Google Cloud DLP - CSV inspection - ruby

I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)

The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.

xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)

Related

Specifying parameters in yml file for Quarto

I am creating a quarto book project in RStudio to render an html document.
I need to specify some parameters in the yml file but the qmd file returns
"object 'params' not found". Using knitR.
I use the default yml file where I have added params under the book tag
project:
type: book
book:
title: "Params_TEst"
author: "Jane Doe"
date: "15/07/2022"
params:
pcn: 0.1
chapters:
- index.qmd
- intro.qmd
- summary.qmd
- references.qmd
bibliography: references.bib
format:
html:
theme: cosmo
pdf:
documentclass: scrreprt
editor: visual
and the qmd file looks like this
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
```{r}
1 + 1
params$pcn
When I render the book, or preview the book in Rstudio the error I receive is:
Quitting from lines 8-10 (index.qmd)
Error in eval(expr, envir, enclos) : object 'params' not found
Calls: .main ... withVisible -> eval_with_user_handlers -> eval -> eval
I have experimented placing the params line in the yml in different places but nothing works so far.
Could anybody help?
For multi-page renders, e.g. quarto books, you need to add the YAML to each page, not in the _quarto.yml file
So in your case, each of the chapters that calls a parameter needs a YAML header, like index.qmd, intro.qmd, and summary.qmd, but perhaps not references.qmd.
The YAML header should look just like it does in a standard Rmd. So for example, your index.qmd would look like this:
---
params:
pcn: 0.1
---
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
```{r}
1 + 1
params$pcn
But, what if you need to change the parameter and re-render?
Then simply pass new parameters to the quarto_render function
quarto::quarto_render(input = here::here("quarto"), #expecting a dir to render
output_format = "html", #output dir is set in _quarto.yml
cache_refresh = TRUE,
execute_params = list(pcn = 0.2))
For now, this only seems to work if you add the parameters to each individual page front-matter YAML.
If you have a large number of pages and need to keep parameters centralized, a workaround is to run a preprocessing script that replaces the parameters in all pages. To add a preprocessing script, add the key pre-render to your _quarto.yml file. The Quarto website has detailed instructions.
For example, if you have N pages named index<N>.qmd, you could have a placeholder in the YML of each page:
---
title: This is chapter N
yourparamplaceholder
---
Your pre-render script could replace yourparamplaceholder with the desired parameters. Here's an example Python script:
for filename in os.listdir(dir):
if filename.endswith(".qmd"):
with open(filename, "r") as f:
txt = f.read()
f.replace('yourparamplaceholder', 'params:\n\tpcn: 0.1\n\tother:20\n')
with open(filename, "w") as ff:
ff.write(txt)
I agree with you that being able to set parameters centrally would be a good idea.

Using terraform yamldecode to access multi level element

I have a yaml file (also used in a azure devops pipeline so needs to be in this format) which contains some settings I'd like to directly access from my terraform module.
The file looks something like:
variables:
- name: tenantsList
value: tenanta,tenantb
- name: unitName
value: canary
I'd like to have a module like this to access the settings but I can't see how to get to the bottom level:
locals {
settings = yamldecode(file("../settings.yml"))
}
module "infra" {
source = "../../../infra/terraform/"
unitname = local.settings.variables.unitName
}
But the terraform plan errors with this:
Error: Unsupported attribute
on canary.tf line 16, in module "infra":
16: unitname = local.settings.variables.unitName
|----------------
| local.settings.variables is tuple with 2 elements
This value does not have any attributes.
It seems like the main reason this is difficult is because this YAML file is representing what is logically a single map but is physically represented as a YAML list of maps.
When reading data from a separate file like this, I like to write an explicit expression to normalize it and optionally transform it for more convenient use in the rest of the Terraform module. In this case, it seems like having variables as a map would be the most useful representation as a Terraform value, so we can write a transformation expression like this:
locals {
raw_settings = yamldecode(file("${path.module}/../settings.yml"))
settings = {
variables = tomap({
for v in local.raw_settings.variables : v.name => v.value
})
}
}
The above uses a for expression to project the list of maps into a single map using the name values as the keys.
With the list of maps converted to a single map, you can then access it the way you originally tried:
module "infra" {
source = "../../../infra/terraform/"
unitname = local.settings.variables.unitName
}
If you were to output the transformed value of local.settings as YAML, it would look something like this, which is why accessing the map elements directly is now possible:
variables:
tenantsList: tenanta,tenantb
unitName: canary
This will work only if all of the name strings in your input are unique, because otherwise there would not be a unique map key for each element.
(Writing a normalization expression like this also doubles as some implicit validation for the shape of that YAML file: if variables were not a list or if the values were not all of the same type then Terraform would raise a type error evaluating that expression. Even if no transformation is required, I like to write out this sort of expression anyway because it serves as some documentation for what shape the YAML file is expected to have, rather than having to study all of the references to it throughout the rest of the configuration.)
With my multidecoder for YAML and JSON you are able to access multiple YAML and/or JSON files with their relative paths in one step.
Documentations can be found here:
Terraform Registry -
https://registry.terraform.io/modules/levmel/yaml_json/multidecoder/latest?tab=inputs
GitHub:
https://github.com/levmel/terraform-multidecoder-yaml_json
Usage
Place this module in the location where you need to access multiple different YAML and/or JSON files (different paths possible) and pass
your path/-s in the parameter filepaths which takes a set of strings of the relative paths of YAML and/or JSON files as an argument. You can change the module name if you want!
module "yaml_json_decoder" {
source = "levmel/yaml_json/multidecoder"
version = "0.2.1"
filepaths = ["routes/nsg_rules.yml", "failover/cosmosdb.json", "network/private_endpoints/*.yaml", "network/private_links/config_file.yml", "network/private_endpoints/*.yml", "pipeline/config/*.json"]
}
Patterns to access YAML and/or JSON files from relative paths:
To be able to access all YAML and/or JSON files in a folder entern your path as follows "folder/rest_of_folders/*.yaml", "folder/rest_of_folders/*.yml" or "folder/rest_of_folders/*.json".
To be able to access a specific YAML and/or a JSON file in a folder structure use this "folder/rest_of_folders/name_of_yaml.yaml", "folder/rest_of_folders/name_of_yaml.yml" or "folder/rest_of_folders/name_of_yaml.json"
If you like to select all YAML and/or JSON files within a folder, then you should use "*.yml", "*.yaml", "*.json" format notation. (see above in the USAGE section)
YAML delimiter support is available from version 0.1.0!
WARNING: Only the relative path must be specified. The path.root (it is included in the module by default) should not be passed, but everything after it.
Access YAML and JSON entries
Now you can access all entries within all the YAML and/or JSON files you've selected like that: "module.yaml_json_decoder.files.[name of your YAML or JSON file].entry". If the name of your YAML or JSON file is "name_of_your_config_file" then access it as follows "module.yaml_json_decoder.files.name_of_your_config_file.entry".
Example of multi YAML and JSON file accesses from different paths (directories)
first YAML file:
routes/nsg_rules.yml
rdp:
name: rdp
priority: 80
direction: Inbound
access: Allow
protocol: Tcp
source_port_range: "*"
destination_port_range: 3399
source_address_prefix: VirtualNetwork
destination_address_prefix: "*"
---
ssh:
name: ssh
priority: 70
direction: Inbound
access: Allow
protocol: Tcp
source_port_range: "*"
destination_port_range: 24
source_address_prefix: VirtualNetwork
destination_address_prefix: "*"
second YAML file:
services/logging/monitoring.yml
application_insights:
application_type: other
retention_in_days: 30
daily_data_cap_in_gb: 20
daily_data_cap_notifications_disabled: true
logs:
# Optional fields
- "AppMetrics"
- "AppAvailabilityResults"
- "AppEvents"
- "AppDependencies"
- "AppBrowserTimings"
- "AppExceptions"
- "AppExceptions"
- "AppPerformanceCounters"
- "AppRequests"
- "AppSystemEvents"
- "AppTraces"
first JSON file:
test/config/json_history.json
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
main.tf
module "yaml_json_multidecoder" {
source = "levmel/yaml_json/multidecoder"
version = "0.2.1"
filepaths = ["routes/nsg_rules.yml", "services/logging/monitoring.yml", test/config/*.json]
}
output "nsg_rules_entry" {
value = module.yaml_json_multidecoder.files.nsg_rules.aks.ssh.source_address_prefix
}
output "application_insights_entry" {
value = module.yaml_json_multidecoder.files.monitoring.application_insights.daily_data_cap_in_gb
}
output "json_history" {
value = module.yaml_json_multidecoder.files.json_history.glossary.title
}
Changes to Outputs:
nsg_rules_entry = "VirtualNetwork"
application_insights_entry = 20
json_history = "example glossary"

Put text on transfer.sh/file.io from ruby

I'm trying to upload text (not a file) to transfer.sh (edit: and file.io/0x0.st) from ruby. All services only give curl examples but not anything for programming languages. I tried
puts HTTParty.post(
'https://transfer.sh/',
multipart: true,
body: { file: 'some text' }
).body.inspect
but that just gives me an empty string as result, not the url where I should find the upload.

How to access JSON from external data source in Terraform?

I am receiving JSON from a http terraform data source
data "http" "example" {
url = "${var.cloudwatch_endpoint}/api/v0/components"
# Optional request headers
request_headers {
"Accept" = "application/json"
"X-Api-Key" = "${var.api_key}"
}
}
It outputs the following.
http = [{"componentID":"k8QEbeuHdDnU","name":"Jenkins","description":"","status":"Partial Outage","order":1553796836},{"componentID":"ui","name":"ui","description":"","status":"Operational","order":1554483781},{"componentID":"auth","name":"auth","description":"","status":"Operational","order":1554483781},{"componentID":"elig","name":"elig","description":"","status":"Operational","order":1554483781},{"componentID":"kong","name":"kong","description":"","status":"Operational","order":1554483781}]
which is a string in terraform. In order to convert this string into JSON I pass it to an external data source which is a simple ruby function. Here is the terraform to pass it.
data "external" "component_ids" {
program = ["ruby", "./fetchComponent.rb",]
query = {
data = "${data.http.example.body}"
}
}
Here is the ruby function
#!/usr/bin/env ruby
require 'json'
data = JSON.parse(STDIN.read)
results = data.to_json
STDOUT.write results
All of this works. The external data outputs the following (It appears the same as the http output) but according to terraform docs this should be a map
external1 = {
data = [{"componentID":"k8QEbeuHdDnU","name":"Jenkins","description":"","status":"Partial Outage","order":1553796836},{"componentID":"ui","name":"ui","description":"","status":"Operational","order":1554483781},{"componentID":"auth","name":"auth","description":"","status":"Operational","order":1554483781},{"componentID":"elig","name":"elig","description":"","status":"Operational","order":1554483781},{"componentID":"kong","name":"kong","description":"","status":"Operational","order":1554483781}]
}
I was expecting that I could now access data inside of the external data source. I am unable.
Ultimately what I want to do is create a list of the componentID variables which are located within the external data source.
Some things I have tried
* output.external: key "0" does not exist in map data.external.component_ids.result in:
${data.external.component_ids.result[0]}
* output.external: At column 3, line 1: element: argument 1 should be type list, got type string in:
${element(data.external.component_ids.result["componentID"],0)}
* output.external: key "componentID" does not exist in map data.external.component_ids.result in:
${data.external.component_ids.result["componentID"]}
ternal: lookup: lookup failed to find 'componentID' in:
${lookup(data.external.component_ids.*.result[0], "componentID")}
I appreciate the help.
can't test with the variable cloudwatch_endpoint, so I have to think about the solution.
Terraform can't decode json directly before 0.11.x. But there is a workaround to work on nested lists.
Your ruby need be adjusted to make output as variable http below, then you should be fine to get what you need.
$ cat main.tf
variable "http" {
type = "list"
default = [{componentID = "k8QEbeuHdDnU", name = "Jenkins"}]
}
output "http" {
value = "${lookup(var.http[0], "componentID")}"
}
$ terraform apply
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Outputs:
http = k8QEbeuHdDnU

Vision API: How to get JSON-output

I'm having trouble saving the output given by the Google Vision API. I'm using Python and testing with a demo image. I get the following error:
TypeError: [mid:...] + is not JSON serializable
Code that I executed:
import io
import os
import json
# Imports the Google Cloud client library
from google.cloud import vision
from google.cloud.vision import types
# Instantiates a client
vision_client = vision.ImageAnnotatorClient()
# The name of the image file to annotate
file_name = os.path.join(
os.path.dirname(__file__),
'demo-image.jpg') # Your image path from current directory
# Loads the image into memory
with io.open(file_name, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
# Performs label detection on the image file
response = vision_client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description, label.score, label.mid)
with open('labels.json', 'w') as fp:
json.dump(labels, fp)
the output appears on the screen, however I do not know exactly how I can save it. Anyone have any suggestions?
FYI to anyone seeing this in the future, google-cloud-vision 2.0.0 has switched to using proto-plus which uses different serialization/deserialization code. A possible error you can get if upgrading to 2.0.0 without changing the code is:
object has no attribute 'DESCRIPTOR'
Using google-cloud-vision 2.0.0, protobuf 3.13.0, here is an example of how to serialize and de-serialize (example includes json and protobuf)
import io, json
from google.cloud import vision_v1
from google.cloud.vision_v1 import AnnotateImageResponse
with io.open('000048.jpg', 'rb') as image_file:
content = image_file.read()
image = vision_v1.Image(content=content)
client = vision_v1.ImageAnnotatorClient()
response = client.document_text_detection(image=image)
# serialize / deserialize proto (binary)
serialized_proto_plus = AnnotateImageResponse.serialize(response)
response = AnnotateImageResponse.deserialize(serialized_proto_plus)
print(response.full_text_annotation.text)
# serialize / deserialize json
response_json = AnnotateImageResponse.to_json(response)
response = json.loads(response_json)
print(response['fullTextAnnotation']['text'])
Note 1: proto-plus doesn't support converting to snake_case names, which is supported in protobuf with preserving_proto_field_name=True. So currently there is no way around the field names being converted from response['full_text_annotation'] to response['fullTextAnnotation']
There is an open closed feature request for this: googleapis/proto-plus-python#109
Note 2: The google vision api doesn't return an x coordinate if x=0. If x doesn't exist, the protobuf will default x=0. In python vision 1.0.0 using MessageToJson(), these x values weren't included in the json, but now with python vision 2.0.0 and .To_Json() these values are included as x:0
Maybe you were already able to find a solution to your issue (if that is the case, I invite you to share it as an answer to your own post too), but in any case, let me share some notes that may be useful for other users with a similar issue:
As you can check using the the type() function in Python, response is an object of google.cloud.vision_v1.types.AnnotateImageResponse type, while labels[i] is an object of google.cloud.vision_v1.types.EntityAnnotation type. None of them seem to have any out-of-the-box implementation to transform them to JSON, as you are trying to do, so I believe the easiest way to transform each of the EntityAnnotation in labels would be to turn them into Python dictionaries, then group them all into an array, and transform this into a JSON.
To do so, I have added some simple lines of code to your snippet:
[...]
label_dicts = [] # Array that will contain all the EntityAnnotation dictionaries
print('Labels:')
for label in labels:
# Write each label (EntityAnnotation) into a dictionary
dict = {'description': label.description, 'score': label.score, 'mid': label.mid}
# Populate the array
label_dicts.append(dict)
with open('labels.json', 'w') as fp:
json.dump(label_dicts, fp)
There is a library released by Google
from google.protobuf.json_format import MessageToJson
webdetect = vision_client.web_detection(blob_source)
jsonObj = MessageToJson(webdetect)
I was able to save the output with the following function:
# Save output as JSON
def store_json(json_input):
with open(json_file_name, 'a') as f:
f.write(json_input + '\n')
And as #dsesto mentioned, I had to define a dictionary. In this dictionary I have defined what types of information I would like to save in my output.
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 20,
},
{
'type': 'TEXT_DETECTION',
'maxResults': 20,
},
{
'type': 'WEB_DETECTION',
'maxResults': 20,
}]
}]
})
The objects in the current Vision library lack serialization functions (although this is a good idea).
It is worth noting that they are about to release a substantially different library for Vision (it is on master of vision's repo now, although not released to PyPI yet) where this will be possible. Note that it is a backwards-incompatible upgrade, so there will be some (hopefully not too much) conversion effort.
That library returns plain protobuf objects, which can be serialized to JSON using:
from google.protobuf.json_format import MessageToJson
serialized = MessageToJson(original)
You can also use something like protobuf3-to-dict

Resources