I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet file within a Google Cloud Storage directory and also using cloudStorageOptions to save the .csv output.
The .parquet file is 53.93 M.
When I make the API call on the .parquet file I get :
"processedBytes": "102308122",
"infoTypeStats": [{
"infoType": {
"name": "AMERICAN_BANKERS_CUSIP_ID"
},
"count": "1"
}, {
"infoType": {
"name": "IP_ADDRESS"
},
"count": "17"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "148"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "30"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "22"
}]
When I convert the .parquet file to .csv I get a 360.58 MB file. Then if I make the API call on the .csv file I get:
"processedBytes": "377530307",
"infoTypeStats": [{
"infoType": {
"name": "CREDIT_CARD_NUMBER"
},
"count": "56546"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "372527"
}, {
"infoType": {
"name": "NETHERLANDS_BSN_NUMBER"
},
"count": "5"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "1331321"
}, {
"infoType": {
"name": "AUSTRALIA_TAX_FILE_NUMBER"
},
"count": "52269"
}, {
"infoType": {
"name": "PHONE_NUMBER"
},
"count": "28"
}, {
"infoType": {
"name": "US_DRIVERS_LICENSE_NUMBER"
},
"count": "114"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "141383"
}, {
"infoType": {
"name": "KOREA_RRN"
},
"count": "56144"
}],
Obviously when I scan the .parquet file not all the infoTypes are detected compared to running the scan on the .csv file where I verified that all EmailAddresses were detected.
I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.
Any help would be greatly appreciated.
Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.
Related
We use enterprise search indexes to store items that can be tagged by multiple tenants.
e.g
[
{
"id": 1,
"name": "document 1",
"tags": [
{ "company_id": 1, "tag_id": 1, "tag_name": "bla" },
{ "company_id": 2, "tag_id": 1, "tag_name": "bla" }
]
}
]
I'm looking to find a way to retrieve all documents with only the tags of company 1
This request:
{
"query": "",
"facets": {
"tags": {
"type": "value"
}
},
"sort": {
"created": "desc"
},
"page": {
"size": 20,
"current": 1
}
}
Is coming back with
...
"facets": {
"tags": [
{
"type": "value",
"data": [
{
"value": "{\"company_id\":1,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
},
{
"value": "{\"company_id\":2,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
}
]
}
],
}
...
Can I modify the request in a way such that I get no tags by "company_id" = 2 ?
I have a solution that involves modifying the results to strip the extra data after they are retrieved but I'm looking for a better solution.
Hi I would like to draw from graphql only those records whose date is equal to the month - August
If I want to pull another month, it is enough to replace it only in the query. At the moment, my query takes all the months instead of the ones it gives inside the filter
schema.json
{
"kind": "collectionType",
"collectionName": "product_popularities",
"info": {
"singularName": "product-popularity",
"pluralName": "product-popularities",
"displayName": "Popularity",
"description": ""
},
"options": {
"draftAndPublish": true
},
"pluginOptions": {},
"attributes": {
"podcast": {
"type": "relation",
"relation": "manyToOne",
"target": "api::product.products",
"inversedBy": "products"
},
"value": {
"type": "integer"
},
"date": {
"type": "date"
}
}
}
My query
query {
Popularities(filters: {date: {contains: [2022-08]}}) {
data {
attributes {
date
value
}
}
}
}
Response
{
"data": {
"Popularities": {
"data": [
{
"attributes": {
"date": "2022-08-03",
"value": 50
}
},
{
"attributes": {
"date": "2022-08-04",
"value": 1
}
},
{
"attributes": {
"date": "2022-08-10",
"value": 100
}
},
{
"attributes": {
"date": "2022-07-06",
"value": 20
}
}
]
}
}
}
Due to a (somehow) special requirement to store data alongside a FHIR resource, I am looking for the cleanest way to extend it to maintain such extra data, which is basically dynamic key-value lists.
On the one hand I have something like the following, with 0..* sets of key-value, that I have tried to store nesting three extensions (adjusting the context by level) being the last a single limitation to the accepted types (string/integer), although it does not allow me to repeat ids, anyway I am getting around it using prefixes XD.
"n-lists-of-key-value": [
{
"list1-key1": "list1-value1",
"list1-key2": "list1-value1",
"list1-key3": "list1-value1"
},
{
"list2-key1": "list2-value1",
"list2-key2": "list2-value2"
}
]
becomes:
···,
"extension": [ {
"id": "n-lists-of-key-value",
"url": "http://x.org/fhir/Extension/NListOfKeyValueExtension",
"extension": [ {
"id": "list_1",
"url": "http://x.org/fhir/Extension/ListKeyValueExtension,
"extension": [ {
"id": "list1-key1",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value1"
}, {
"id": "list1-key2",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value1"
}, {
"id": "list1-key3",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value1"
} ]
}, {
"id": "list_2",
"url": "http://x.org/fhir/Extension/ListKeyValueExtension,
"extension": [ {
"id": "list2-key1",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value1"
}, {
"id": "list2-key2",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value2"
} ]
} ]
} ],
But on the other hand I have this, which is what I find most difficult to model, because trying to reuse the extension definitions I run into (conceptual?) difficulties with slicing and the use of the discriminators, failing to get past the error "Element matches more than one slice".
"my-3-fixed-named-sets": {
"fixed-named-set1": {
"list1-key1": "list1-value1",
"list1-key2": "list1-value2",
"list1-key3": "list1-value3"
},
"fixed-named-set2": {
"list2-key1": "list2-value1",
"list2-key2": "list2-value2"
},
"fixed-named-set3": {
"list2-key1": "list2-value1",
"list2-key2": "list2-value2"
}
}
becomes:
···,
"extension": [ {
"id": "my-3-fixed-named-sets",
"url": "http://x.org/fhir/Extension/NListOfKeyValueExtension",
"extension": [ {
"id": "fixed-named-set1",
"url": "http://x.org/fhir/Extension/ListKeyValueExtension,
"extension": [ {
"id": "list1-key1",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value1"
}, {
"id": "list1-key2",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value2"
}, {
"id": "list1-key3",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list1-value3"
} ]
}, {
"id": "fixed-named-set2",
"url": "http://x.org/fhir/Extension/ListKeyValueExtension,
"extension": [ {
"id": "list2-key1",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value1"
}, {
"id": "list2-key2",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value2"
} ],
}, {
"id": "fixed-named-set3",
"url": "http://x.org/fhir/Extension/ListKeyValueExtension,
"extension": [ {
"id": "list2-key1",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value1"
}, {
"id": "list2-key3",
"url": "http://x.org/fhir/Extension/KeyValueExtension,
"valueString": "list2-value3"
} ]
} ]
} ],
I am aware that by indiscriminately including new extensions with different names/urls, I can end up ingesting all the data, but I suspect that there is a much cleaner way and that I am failing in , probably, the way to include an extra condition into the discriminator (https://hl7.org/fhir/profiling.html#slicing), so that I could point to the proper slide? or de fact of extending extensions or extending value[x] slices?
"discriminator": [
{
"type": "value",
"path": "url"
},
{ "???": "??Extension.value.slideName??
···
Then my specific question would be... can I slice the value[x] of the extension of a specific profile, and which in turn make them extensions? this lasy is to get having n key value lists
Any comment is very welcome :)
I found an interesting article that used several data models on Vega-Lite. Tabular data were combined by key like in relational databases.
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"title": "Test",
"datasets": {
"stores": [
{"cmdb_id1": 1, "group": "type1"},
{"cmdb_id1": 2, "group": "type1"},
{"cmdb_id1": 3, "group": "type2"}
],
"labelsTimelines": [
{"cmdb_id2": 1, "value": 83},
{"cmdb_id2": 2, "value": 53},
{"cmdb_id2": 3, "value": 23}
]
},
"data": {"name": "stores"},
"transform": [
{
"lookup": "cmdb_id1",
"from": {
"data": {"name": "labelsTimelines"},
"key": "cmdb_id2",
"fields": ["value"]
}
}
],
"mark": "bar",
"encoding": {
"y": {"aggregate": "sum", "field": "value", "type": "quantitative"},
"x": {"field": "group", "type": "ordinal"}
}
}
Vega Editor
The question arose as to whether it was possible to obtain the same result using the construction:
"data": {"url": "...."}
Changed the source for Elasticsearch query:
{
"$schema": "https://vega.github.io/schema/vega-lite/v3.json",
"datasets": {
"stores": [{
"url": {
"%context%": "true"
"index": "test_cmdb"
"body": {
"size": 1000,
"_source": ["cmdb_id", "street","group"]
}
}
format: {property: "hits.hits"}
}]}
"data": {
"name": "stores"
},
"encoding": {
"x": {"field": "url.body.size", "type": "ordinal", "title": "X"},
"y": {"field": "url.body.size", "type": "ordinal", "title": "Y"}
},
"layer": [
{
"mark": "rect",
"encoding": {
"tooltip": [
{"field": "url"}]
}
}
]
}
I understand that there is a syntactical error, the data did not come from Elasticsearch.
Thanks in advance!
example.png
No, it is not currently possible to specify URL data within top-level "datasets". The relevant open feature request in Vega-Lite is here: https://github.com/vega/vega-lite/issues/4004.
Your much better off using Vega rather than Vega-lite for this. In Vega you can specify as many datasets as you like with a URL. For example...
...
data: [
{
name: dataset_1
url: {
...
}
}
{
name: dataset_2
url: {
...
}
}
]
...
This can actually get very interesting since it means you can combine data from multiple indices into one visualisation.
I know this is late, but figured this might help people who are looking around.
I am trying to convert a JSON File into CSV but I don't seem to have any luck in doing so. My JSON looks something like that:
...
{
{"meta": {
"contentType": "Response"
},
"content": {
"data": {
"_type": "ObjectList",
"erpDataObjects": [
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
"fields": {
"number": {
"value": "1",
},
"id": {
"value": "10000"
},
}
}
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
"fields": {
"number": {
"value": "2",
},
"id": {
"value": "10001"
},
}
}
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000",
},
"head": {
.. much more data
I basically want my csv to look like this:
number,id
1,10000
2,10001
My flow looks like this:
GetFile -> Set the output-file name -> ConvertRecord -> UpdateAttribute -> PutFile
ConvertRecord uses the JsonTreeReader and a CSVRecordSetWriter
JsonTreeReader
CsvRecordSetWriter.
They both call on an AvroSchemaRegistry which looks like this:
AvroSchemaRegistry
The AvroSchema itself looks like this:
{
"type": "record",
"name": "head",
"fields":
[
{"name": "number", "type": ["string"]},
{"name": "id", "type": ["string"]},
]
}
But I only get this output:
number,id
,
Which makes sense because I'm not specifically indicating where those values are located. I used the JsonPathReader instead before but it only looked like this:
JsonPathReader
Which obvioulsy only gave me one record. I'm not really sure how I can configure either of the two to output exactly what I want. Help would be much appreciated!
Using ConvertRecord for JSON -> CSV is mostly intended for "flat" JSON files where each field in the object becomes a column in the outgoing CSV file. For nested/complex structures, consider JoltConvertRecord, it allows you to do more complex transformations. Your example doesn't appear to be valid JSON as-is, but assuming you have something like this as input:
{
"meta": {
"contentType": "Response"
},
"content": {
"data": {
"_type": "ObjectList",
"erpDataObjects": [
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000"
},
"head": {
"fields": {
"number": {
"value": "1"
},
"id": {
"value": "10000"
}
}
}
},
{
"meta": {
"lastModified": "2020-08-10T08:37:21.000+0000"
},
"head": {
"fields": {
"number": {
"value": "2"
},
"id": {
"value": "10001"
}
}
}
}
]
}
}
}
The following JOLT spec should give you what you want for output:
[
{
"operation": "shift",
"spec": {
"content": {
"data": {
"erpDataObjects": {
"*": {
"head": {
"fields": {
"number": {
"value": "[&4].number"
},
"id": {
"value": "[&4].id"
}
}
}
}
}
}
}
}
}
]