Unable to decode blob metadata into Azure Search index - azure-blob-storage

I'm using an indexer to index PDF files into an Azure Search Index. I have some metadata parameters encoded as URL-safe base64 (document_url in the screenshot):
Everything works fine. The indexer runs and the document_url is decoded and indexed in a Url property.
The problem comes when I try to do the same for a metadata_title parameter. Configured in exactly the same way as 'document_url', when the indexer runs it throws an error.
Message: Could not parse document. Could not apply mapping function 'base64Decode' to field 'metadata_title'.
Details: The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
The metadata_title has the following value: Tm9uLVNtb2tlcsKQcyBEZWNsYXJhdGlvbg
Using an external tool I'm able to decode that without issues.
The mapping configuration for both 'document_url' and 'metadata_title' are the same:
{
"sourceFieldName": "metadata_title",
"targetFieldName": "MetaTitle",
"mappingFunction": {
"name": "base64Decode",
"parameters": {
"useHttpServerUtilityUrlTokenDecode": false
}
}
}
Even if I remove the 'metadata_title' property from the blob, it keeps throwing the same error.
Maybe the problem is in the Index?
This is where the metadata_title is being mapped:
{
"name": "MetaTitle",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"synonymMaps": []
}
This property is searchable, while Url (the property where 'document_url' is being mapped) is not. That's the only difference I can see.

'metadata_title' is one of the reserved metadata properties for PDFs when using the blob indexer with "parsingMode": "default" and "dataToExtract": "contentAndMetadata", so I suspect what is happening is that the property that is attempting to be decoded is the one that ACS extracted from your document (which is not base 64 encoded) and not the one from your blob's metadata. To get around this, you could need to rename the blob metadata field to have a name that is not one of the reserved properties that ACS extracts on your behalf. If that is not feasible, please open a support ticket so that one of our engineers can work with you on a more custom solution.

Related

Azure Data Factory REST API paging with Elasticsearch

During developing pipeline which will use Elasticsearch as a source I faced with issue related paging. I am using SQL Elasticsearch API. Basically, I've started to do request in postman and it works well. The body of request looks following:
{
"query":"SELECT Id,name,ownership,modifiedDate FROM \"core\" ORDER BY Id",
"fetch_size": 20,
"cursor" : ""
}
After first run in response body it contains cursor string which is pointer to next page. If in postman I send the request and provide cursor value from previous request it return data for second page and so on. I am trying to archive the same result in Azure Data Factory. For this I using copy activity, which store response to Azure blob. Setup for source is following.
copy activity source configuration
This is expression for body
{
"query": "SELECT Id,name,ownership,modifiedDate FROM \"#{variables('TableName')}\" WHERE ORDER BY Id","fetch_size": #{variables('Rows')}, "cursor": ""
}
I have no idea how to correctly setup pagination rule. The pipeline works properly but only for the first request. I've tried to setup Headers.cursor and expression $.cursor but this setup leads to an infinite loop and pipeline fails with the Elasticsearch restriction.
I've also tried to read document at https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#pagination-support but it seems pretty limited in terms of usage examples and difficult for understanding.
Could somebody help me understand how to build the pipeline with paging abilities utilization?
Responce with the cursor looks like:
{
"columns": [
{
"name": "companyId",
"type": "integer"
},
{
"name": "name",
"type": "text"
},
{
"name": "ownership",
"type": "keyword"
},
{
"name": "modifiedDate",
"type": "datetime"
}
],
"rows": [
[
2,
"mic Inc.",
"manufacture",
"2021-03-31T12:57:51.000Z"
]
],
"cursor": "g/WuAwFaAXNoRG5GMVpYSjVWR2hsYmtabGRHTm9BZ0FBQUFBRUp6VGxGbUpIZWxWaVMzcGhVWEJITUhkbmJsRlhlUzFtWjNjQUFBQUFCQ2MwNWhaaVIzcFZZa3Q2WVZGd1J6QjNaMjVSVjNrdFptZDP/////DwQBZgljb21wYW55SWQBCWNvbXBhbnlJZAEHaW50ZWdlcgAAAAFmBG5hbWUBBG5hbWUBBHRleHQAAAABZglvd25lcnNoaXABCW93bmVyc2hpcAEHa2V5d29yZAEAAAFmDG1vZGlmaWVkRGF0ZQEMbW9kaWZpZWREYXRlAQhkYXRldGltZQEAAAEP"
}
I finally find the solution, hopefully, it will be useful for the community.
Basically, what needs to be done it is split the solution into four steps.
Step 1 Make the first request as in the question description and stage file to blob.
Step 2 Read blob file and get the cursor value, set it to variable
Step 3 Keep requesting data with a changed body
{"cursor" : "#{variables('cursor')}" }
Pipeline looks like this:
pipeline
Configuration of pagination looks following
pagination . It is a workaround as the server ignores this header, but we need to have something which allows sending a request in loop.

Logic Apps and Outlook embedded images

I want to copy embedded images from an outlook email to my blob storage. I have a step to run a For Each over the email attachments (the embedded images are recognized as attachments). But the when I look at the raw input for the email message, the attachment[0].contentBytes is Null. And the error I get when trying to create the blob using setting the blob content to "Attachments Content" is:
"InvalidTemplate. Unable to process template language expressions in action 'Create_blob' inputs at line '1' and column '3316': 'The template language function 'base64ToBinary' expects its parameter to be a string. The provided value is of type 'Null'. Please see https://aka.ms/logicexpressions#base64ToBinary for usage details.'."
Here is the Raw Input to the "When new email arrives" step:
"Attachments": [
{
"#odata.type": "#Microsoft.OutlookServices.FileAttachment",
"Id": "12345",
"LastModifiedDateTime": "2020-11-22T07:38:19+00:00",
"Name": "footer.jpg",
"ContentType": "image/jpeg",
"Size": 13969,
"IsInline": false,
"ContentBytes": null
}
]
Any help with these steps is much appreciated. Thanks!
Please click Add new parameter and choose Include Attachments. select Yes for Include Attachments:

How to convert json to collection in power apps

I have a power app that using the flow from power automate.
My flow is doing an HTTP get and respond a JSON to power apps like below.
Here is the JSON as text:
{"value": "[{\"dataAreaId\":\"mv\",\"AccountNum\":\"100000\",\"Name\":\"*****L FOOD AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100001\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100014\",\"Name\":\"****(SEB)\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100021\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100029\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"500100\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"500210\",\"Name\":\"****\"}]"}
But when I try to convert this JSON to the collection, It doesn't behave like a list.
It just seems like a text. Here is how I try to bind the list.
How can I create a collection from JSON to bind to the gallery view?
I found the solution. I finally create a collection from the response of flow.
The flow's name is GetVendor.
The response of flow is like this :
{"value": "[{\"dataAreaId\":\"mv\",\"AccountNum\":\"100000\",\"Name\":\"*****L FOOD AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100001\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100014\",\"Name\":\"****(SEB)\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100021\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"100029\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"500100\",\"Name\":\"**** AB\"},{\"dataAreaId\":\"mv\",\"AccountNum\":\"500210\",\"Name\":\"****\"}]"}
Below code creates a list from this response :
ClearCollect(_vendorData, MatchAll(GetVendors.Run(_token.value).value, "\{""dataAreaId"":""(?<dataAreaId>[^""]*)"",""AccountNum"":""(?<AccountNum>[^""]*)"",""Name"":""(?<Name>[^""]*)""\}"));
And I could bind the accountnum and name from _vendorDatra collection to the gallery view
In my case I had the same issue as you, but couldn't manage to get data into _vendorData collection, because MatchAll regex part was not working correctly, even if I had exactly the same scenario and I could not make it work.
My solution was to modify the flow itself, where I returned Response instead of Respond to a Power app or Flow, so basically I could return full request from Http.
This caused me some issues also, because when I generated schema from sample I could not register the flow to the powerapp with the error Failed during http send request.
The solution was to manually review the response schema and change all column types to one of the following three, because other are not supported: string, integer or boolean. Object and array can be set only on top level items, but never on children, so if you have anything else than my mentioned three, replace it to string. And no property can be left with undefined type.
Basically I like this solution even more, because in powerapps itself you do not need to do any conversion or anything - simply use the data as is, because it is already recognized as collection in case of array and you have all the properties already named for you.
Response step schema example is below.
{
"type": "object",
"properties": {
"PropertyOne": {
"type": "string"
},
"PropertyTwo": {
"type": "integer"
},
"PropertyThree": {
"type": "boolean"
},
"PropertyFour": {
"type": "array",
"items": {
"type": "object",
"properties": {
"PropertyArray1": {
"type": "string"
},
"PropertyArray1": {
"type": "integer"
},
"PropertyArray1": {
"type": "boolean"
}
}
}
It is easy now.
Power Apps introduced ParseJSON function which helps converting string to collection easily.
Table(ParseJSON(JSONString));
In gallery, map columns like - ThisItem.Value.ColumnName

How to download an image/media using telegram API

I want to start by saying that this question is not for telegram bot API. I am trying to fetch images from a channel using telegram core API. The image is in the media property of the message object
"_": "message",
"pFlags": {
"post": true
},
"flags": 17920,
"post": true,
"id": 11210,
"to_id": {
"_": "peerChannel",
"channel_id": 1171605754
},
"date": 1550556770,
"message": "",
"media": {
"_": "messageMediaPhoto",
"pFlags": {},
"flags": 1,
"photo": {
"_": "photo",
"pFlags": {},
"flags": 0,
"id": "6294134956242348146",
"access_hash": "11226369941418527484",
"date": 1550556770,
I am using the upload.getFile API to fetch the file. Example is
upload.getFile({
location: {
_: 'inputFileLocation',
id: '6294134956242348146',
access_hash: '11226369941418527484'
},
limit: 1000,
offset: 0
})
But the problem is it throws the error RpcError: CODE#400 LIMIT_INVALID. From looking at the https://core.telegram.org/api/files it looks like limit value is invalid. I tried giving limit as
1024000 (1Kb)
20480000 (20Kb)
204800000 (200kb)
But it always return the same error.
For anyone who is also frustrated with the docs. Using, reading and trying out different stuff will ultimately work for you. If possible someone can take up the task of documenting the wonderful open source software.
Coming to the answer, the location object shouldn't contain id or access hash like other APIs rather it has its own parameters as defined in telegram schema.
There is a media property to a message which has a sizes object. This will contains 3 or more size options (thumbnail, preview, websize and so on). Choose the one that you will need and use the volume_id, local_id and secret properties. The working code will look something like this.
upload.getFile({
location: {
_: 'inputFileLocation', (This parameter will change for other files)
volume_id: volumeId,
local_id: localId,
secret: secret
},
limit: 1024 * 1024,
offset: 0
}, {
isFileTransfer: true,
createClient: true
})
The following points should be noted.
Limit should be in bytes (not bits)
Offset will be 0. But if its big file use this and limit to download parts of the file and join them.
Additional parameters such as isFileTransfer and createClient also exists. I haven't fully understood why its needed. If I have time I'll update it later.
Try using a library that's built on top the original telegram library. I'm using Airgram, a JS/TS library which is a well maintained Repo.

Indexer failed to process blob due to missing content type, but the blob has a content type

I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my pdf's, however, fail in the indexer:
[
{
"key": null,
"errorMessage": "Error processing blob 'https://my-storage.blob.core.windows.net/my-container/mydocument.pdf' with content type '': 422"
}
]
I double-checked the properties on the blob to make sure its content type was set:
{
"container": "my-container",
"name": "mydocument.pdf",
"metadata": {},
"lastModified": "Fri, 08 Jul 2016 19:43:15 GMT",
"etag": "0xXXXXXXXXXXXXXXX",
"blobType": "BlockBlob",
"contentLength": "3863790",
"requestId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"contentSettings": {
"contentType": "application/pdf",
"contentMD5": "xxxxxxxxxxxxxxxxxxxxxx=="
},
"lease": {
"status": "unlocked",
"state": "available"
}
}
Now, this particular pdf has some security restrictions (no printing), so I thought that might affect it. I created some pdf's from scratch to test it out, and they worked just fine, both with and without the restrictions.
There are going to be occasional documents that Azure Search cannot handle, due to security restrictions, files being corrupted, etc. There're several knobs to control how such files are handled. Please see this answer for details.

Resources