I’m trying to run a Data Pipeline inside a cloud formation stack. This stack references the exports of another stack which contains a Redshift cluster. When I run it, I get an error stating " 'Ec2Instance', errors = Internal error during validation of this object.”; but I am unable to find any more information on this error or what it means. Other objects show the same error, but they are dependent on Ec2Instance. Based on that, I’m assuming it’s a cascade situation.
Here’s my PipelineObject for the Ec2Instance.
{
"Id": "Ec2Instance",
"Name": "Ec2Instance",
"Fields": [
{
"Key": "type",
"StringValue": "Ec2Resource"
},
{
"Key": "role",
"StringValue": "DataPipelineDefaultRole"
},
{
"Key": "resourceRole",
"StringValue": "DataPipelineDefaultResourceRole"
},
{
"Key": "terminateAfter",
"StringValue": "1 Hour"
},
{
"Key": "actionOnTaskFailure",
"StringValue": "terminate"
},
{
"Key": "actionOnResourceFailure",
"StringValue": "retryAll"
},
{
"Key": "maximumRetries",
"StringValue": "1"
},
{
"Key": "instanceType",
"StringValue": "m1.medium"
},
{
"Key": "securityGroupIds",
"StringValue": "#{myRDSRedshiftSecurityGrps}"
},
{
"Key": "subnetId",
"StringValue": "#{myRedshiftClusterSubnetGroup}"
}
]
}
Related
I am new to Google Cloud DLP and I ran a POST https://dlp.googleapis.com/v2beta1/inspect/operations to scan a .parquet file within a Google Cloud Storage directory and also using cloudStorageOptions to save the .csv output.
The .parquet file is 53.93 M.
When I make the API call on the .parquet file I get :
"processedBytes": "102308122",
"infoTypeStats": [{
"infoType": {
"name": "AMERICAN_BANKERS_CUSIP_ID"
},
"count": "1"
}, {
"infoType": {
"name": "IP_ADDRESS"
},
"count": "17"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "148"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "30"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "22"
}]
When I convert the .parquet file to .csv I get a 360.58 MB file. Then if I make the API call on the .csv file I get:
"processedBytes": "377530307",
"infoTypeStats": [{
"infoType": {
"name": "CREDIT_CARD_NUMBER"
},
"count": "56546"
}, {
"infoType": {
"name": "EMAIL_ADDRESS"
},
"count": "372527"
}, {
"infoType": {
"name": "NETHERLANDS_BSN_NUMBER"
},
"count": "5"
}, {
"infoType": {
"name": "US_TOLLFREE_PHONE_NUMBER"
},
"count": "1331321"
}, {
"infoType": {
"name": "AUSTRALIA_TAX_FILE_NUMBER"
},
"count": "52269"
}, {
"infoType": {
"name": "PHONE_NUMBER"
},
"count": "28"
}, {
"infoType": {
"name": "US_DRIVERS_LICENSE_NUMBER"
},
"count": "114"
}, {
"infoType": {
"name": "US_STATE"
},
"count": "141383"
}, {
"infoType": {
"name": "KOREA_RRN"
},
"count": "56144"
}],
Obviously when I scan the .parquet file not all the infoTypes are detected compared to running the scan on the .csv file where I verified that all EmailAddresses were detected.
I couldn't find any documentation on compressed files such as parquet, so I am assuming that Google Cloud DLP doesn't offer this capability.
Any help would be greatly appreciated.
Parquet files are currently scanned as binary objects, as the system does not parse them smartly yet. In the V2 api the supported file types are listed here https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype.
I'm using Elastic Search to create a search filter and I need to find all the values saved in the database of the "cambio" column without repeating the values.
The values are saved as follows: "Manual de 5 marchas" or "Manual de 6 marchas"....
I created this query to return all saved values:
GET /crawler10/crawler-vehicles10/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "cambio"
}
}
}
}
But when I run the returned values they look like this:
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 2,
"sum_other_doc_count": 2613,
"buckets": [
{
"key": "de",
"doc_count": 2755
},
{
"key": "marchas",
"doc_count": 2714
},
{
"key": "manual",
"doc_count": 2222
},
{
"key": "modo",
"doc_count": 1097
},
{
"key": "5",
"doc_count": 1071
},
{
"key": "d",
"doc_count": 1002
},
{
"key": "n",
"doc_count": 1002
},
{
"key": "automática",
"doc_count": 935
},
{
"key": "com",
"doc_count": 919
},
{
"key": "6",
"doc_count": 698
}
]
}
}
Aggregations are based on the mapping type of the saved field. The field type for cambio seems to be set to analyzed(by default). Please create an index with the mapping not_analyzed for your field cambio.
You can create the index with a PUT request as below (if your ES version is less than 5) and then you will need to re-index your data in the crawler10 index.
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "string"
"index": "not_analyzed"
}
}
}
}
}
For ES v5 or greater
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "keyword"
}
}
}
}
}
I have Elasticsearch documents for products with tags. Following is the structure of the documents:
{
"id": 'id-1',
"name: "prod-1",
"tags": [
{
"id": 'tag-id-1',
"name": 'tag-name-1'
},
{
"id": 'tag-id-2',
"name": 'tag-name-2'
}
]
}
What I want to do is:
Get all products with maximum overlapping tags (something like related questions in stackoverflow, assuming they have stored questions in Elasticsearch with tags). Output something like this:
{
"aggregations": {
"products": [
{
"key": "product-name-1",
"tags": [
{
"key": "tag-name-1",
}
{
"key": "tag-name-2"
}
]
},
{
"key": "product-name-2",
"tags": [
{
"key": "tag-name-2",
}
{
"key": "tag-name-3"
}
]
},
]
}
}
Get all tags grouped together with a tag maximum number of times.
{
"aggregations": {
"tags": [
{
"key": "tag-name-1",
"tags": [
{
"key": "tag-name-2",
}
{
"key": "tag-name-3"
}
]
},
{
"key": "product-name-2",
"tags": [
{
"key": "tag-name-1",
}
{
"key": "tag-name-5"
}
]
},
]
}
}
I am not very sure which type of aggregation methods to use. Any help will be useful.
Thanks.
I have some JSON that I have pulled from AWS and formatted with jq (the original code is at the bottom) to give me the following output:
{
"VolumeId": "vol-11111111",
"Tags": {
"Name": "volume1",
"Finish": "00:00",
"Start": "00:20",
"Period": "2"
}
}
{
"VolumeId": "vol-22222222",
"Tags": {
"Name": "volume2",
"Period": "1",
"Start": "00:00",
"Finish": "00:20"
}
}
{
"VolumeId": "vol-33333333",
"Tags": {
"Period": "1",
"Start": "00:00",
"Name": "volume3",
"Finish": "00:20"
}
}
What I now need to do is to pull the 'VolumeId', 'Period', 'Start' and 'Finish'. I would like to iterate over these objects put these into 4 bash variables of the same name in a for loop.
e.g.
VolumeId="vol-33333333"
Period="1"
Start="00:00"
Finish="00:20"
The problem is that if I put the entire JSON into a variable, it is treated as a single argument. I could use something like mapfile, however it would then turn it into too many arguments - e.g.
}
"Volumes": [
{
etc
Any help in getting this to work would be greatly appreciated. The end result is to be able to take a snapshot of the volume and use the 'Period' tag to calculate retention etc.
--
Original JSON:
{
"Volumes": [
{
"Attachments": [],
"Tags": [
{
"Value": "volume1",
"Key": "Name"
},
{
"Value": "00:00",
"Key": "Start"
},
{
"Value": "00:20",
"Key": "Finish"
},
{
"Value": "2",
"Key": "Period"
}
],
"VolumeId": "vol-11111111"
},
{
"Attachments": [],
"Tags": [
{
"Value": "volume2",
"Key": "Name"
},
{
"Value": "00:00",
"Key": "Start"
},
{
"Value": "00:20",
"Key": "Finish"
},
{
"Value": "2",
"Key": "Period"
}
],
"VolumeId": "vol-22222222"
},
{
"Attachments": [],
"Tags": [
{
"Value": "volume3",
"Key": "Name"
},
{
"Value": "00:00",
"Key": "Start"
},
{
"Value": "00:20",
"Key": "Finish"
},
{
"Value": "2",
"Key": "Period"
}
],
"VolumeId": "vol-33333333"
}
]
}
and the jq command:
jq -r '.Volumes[] | {"VolumeId": .VolumeId, "Tags": [.Tags[]] | from_entries}'
cat rawjsonfile |jq -r '.Volumes[]|({VolumeId}+(.Tags|from_entries))|{VolumeId,Period,Start,Finish}|to_entries[]|(.key+"="+.value)'
the rawjsonfile is your "-- Original JSON"
this result is:
VolumeId=vol-11111111
Period=2
Start=00:00
Finish=00:20
VolumeId=vol-22222222
Period=2
Start=00:00
Finish=00:20
VolumeId=vol-33333333
Period=2
Start=00:00
Finish=00:20
first unwind the array to json units
cat rawjsonfile|jq -r '.Volumes[]|({VolumeId}+(.Tags|from_entries))'
the result of first step like this:
{
"VolumeId": "vol-11111111",
"Name": "volume1",
"Start": "00:00",
"Finish": "00:20",
"Period": "2"
}
{
"VolumeId": "vol-22222222",
"Name": "volume2",
"Start": "00:00",
"Finish": "00:20",
"Period": "2"
}
{
"VolumeId": "vol-33333333",
"Name": "volume3",
"Start": "00:00",
"Finish": "00:20",
"Period": "2"
}
jq support join the json object .
second choose the fields
|{VolumeId,Period,Start,Finish}
3.make it to key-value and join them
|to_entries[]|(.key+"="+.value)
I am just wondering for a aggregation query in ES, is that possible to utilize the returned bucket for your own purpose. For example if I have response result like this:
{
"key": "test",
"doc_count": 2000,
"child": {
"value": 1000
}
}
And I want to get the ratio of doc_count and value, so I am looking for a way to generate another field/aggregation to do the math of those two fields, like this:
{
"key": "test",
"doc_count": 2000,
"child": {
"value": 1000
},
"ratio" : 2
}
or
{
"key": "test",
"doc_count": 1997,
"child": {
"value": 817
},
"buckets": [
{
"key": "ratio",
"value": 2
}
]
}