How do you run an EmrActivity on an existing EMR cluster? - amazon-data-pipeline

Is there a way to run an EmrActivity in AWS Data Pipeline on an existing cluster? We currently are using Data Pipeline to run jobs in AWS EMR using EmrCluster and EmrActivity but we'd like to have all pipelines run on the same cluster. I've tried reading the documentation and building a pipeline in architect but I can't seem to find a way to do anything but create a cluster and run jobs on it. There doesn't seem to be a way to define a new pipeline which uses an existing cluster. If there is how would I do it? We're currently using CloudFormation to create our pipelines so if possible an example using CloudFormation would be preferable but I'll take what I can get.

Yes it is possible.
Launch your EMR cluster
Start TaskRunner on the master instance with the option --workerGroup=name-of-the-worker-group
In the activities of your pipeline don't specify runsOn parameter, pass your worker group instead.
Here is an example of the activity with such parameter defined using CloudFormation:
...
{
"Id": "S3ToRedshiftCopyActivity",
"Name": "S3ToRedshiftCopyActivity",
"Fields": [
{
"Key": "type",
"StringValue": "RedshiftCopyActivity"
},
{
"Key": "workerGroup",
"StringValue": "name-of-the-worker-group"
},
{
"Key": "insertMode",
"StringValue": "#{myInsertMode}"
},
{
"Key": "commandOptions",
"StringValue": "FORMAT CSV"
},
{
"Key": "dependsOn",
"RefValue": "RedshiftTableCreateActivity"
},
{
"Key": "input",
"RefValue": "S3StagingDataNode"
},
{
"Key": "output",
"RefValue": "DestRedshiftTable"
}
]
}
...
You can find detailed documentation how to do that here:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

Related

Objects in array is not well supported error observed for ELK docker image

I'm using the latest elk image for kibana dashboard and I have json file which is having list of array[] and I'm not able to show those as field in kibana and It's showing that the object in array is not well supported error message.
As per the document in kibana I just went through the below link but I didn't find anything useful for elk docker image.
https://github.com/istresearch/kibana-object-format
I just tried to run the command
Run bin/kibana-plugin install <package.zip>
but it returned as run is unknown command removed run and ran remaining command but It says that's invalid.
I'm using linux box and Kibana 7.3 version.
Is it possible to overcome this issue? how to deploy that plugin for elk image else is there any other way to make those arrays object as fields in kibana.
I'm not sure how can I proceed. Please help me.
Sample Data:
{
"expand": "schema,names",
"startAt": 0,
"maxResults": 50,
"total": 4,
"issues": [{
"expand": "operations,versionedRepresentations,editmeta,changelog,renderedFields",
"id": "1999875",
"self": "https://amazon.kindle.com/jira/rest/api/2/issue/1999875",
"key": "KINDLEAMZ-67578",
"fields": {
"summary": "contingency is displaying for confirmed card.",
"priority": {
"name": "P1",
"id": "1"
},
"created": "2019-09-23T11:25:21.000+0000"
}
},
{
"expand": "operations,versionedRepresentations,editmeta,changelog,renderedFields",
"id": "2019428",
"self": "https://amazon.kindle.com/jira/rest/api/2/issue/2019428",
"key": "KINDLEAMZ-68661",
"fields": {
"summary": "card",
"priority": {
"name": "P1",
"id": "1"
},
"created": "2019-09-23T11:25:21.000+0000"
}
},
{
"expand": "operations,versionedRepresentations,editmeta,changelog,renderedFields",
"id": "2010958",
"self": "https://amazon.kindle.com/jira/rest/api/2/issue/2010958",
"key": "KINDLEAMZ-68167",
"fields": {
"summary": "Test Card",
"priority": {
"name": "P1",
"id": "1"
},
"created": "2019-09-23T11:25:21.000+0000"
}
}
]
}
I just want to fetch KEY, Summary, Priority from each of the above array. But its not working as expected when I tried to make a field its showing as array in kibana. If this is not working with 7.3.0 should I downgrade to lower version? the steps are missing for docker user in that document. Is there any way to get those details?
Checking here: https://github.com/istresearch/kibana-object-format/releases it looks like the plugin latest release was for Elasticsearch 6.3. I guess that is the reason for your error.
I'm not sure there's a fix for this in kibana. There are many issues on this subject, open for a long time, like: https://github.com/elastic/kibana/issues/3333.

Azure IoT event subscription with ARM template

I am trying to deploy Azure IoT device connected event subscription to Azure storage queue using ARM template and PowerShell. I have used the following template for deploying this. Also, I have read a lot of articles on Microsoft. But could not find any solution. Please help me to figure it out.
"resources": [
{
"type": "Microsoft.EventGrid/eventSubscriptions",
"name": "DeviceConnected",
"location": "[resourceGroup().location]",
"apiVersion": "2018-01-01",
"dependsOn": [
"[resourceId('Microsoft.Devices/IotHubs', variables('iotHubName'))]"
],
"properties": {
"destination": {
"endpointType": "storagequeue",
"properties": {
"queueName":"device-connnection-state-queue",
"resourceId": "[resourceId('Microsoft.Storage/storageAccounts', variables('storageName'))]"
}
},
"filter": {
"includedEventTypes": [
"Microsoft.Devices.DeviceConnected"
]
}
}
}
],
The error is showing like
The error you're seeing is related to the dependsOn property you've specified.
From MS documentation
Resources that must be deployed before this resource is deployed. Resource Manager evaluates the dependencies between resources and deploys them in the correct order. When resources aren't dependent on each other, they're deployed in parallel. The value can be a comma-separated list of a resource names or resource unique identifiers. Only list resources that are deployed in this template. Resources that aren't defined in this template must already exist. Avoid adding unnecessary dependencies as they can slow your deployment and create circular dependencies. For guidance on setting dependencies, see Defining dependencies in Azure Resource Manager templates.
So a resource that is not defined in an ARM template cannot be used in a DependsOn property.
Here is the documentation related to event subscription creation:
Microsoft.EventGrid eventSubscriptions template reference
There are not so much samples on how to create event subscription but you can extract some part of the template from the Azure Portal:
Click + Event Subscription
Fill in the details
Click the Advanced Editor button link on the top right corner
It will show you some of the details you need to create your ARM Template
Here is how a sample ARM template can look likes:
"resources": [
{
"type": "Microsoft.Devices/IotHubs/providers/eventSubscriptions",
"apiVersion": "2019-01-01",
"name": "[concat(parameters('iotHubName'), '/Microsoft.EventGrid/', parameters('eventSubName'))]",
"location": "[resourceGroup().location]",
"properties": {
"topic": "[resourceId('Microsoft.Devices/IotHubs', parameters('iotHubName'))]",
"destination": {
"endpointType": "StorageQueue",
"properties": {
"resourceId": "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]",
"queueName": "[parameters('queueName')]"
}
},
"filter": {
"includedEventTypes": [
"Microsoft.Devices.DeviceConnected"
],
"advancedFilters": []
},
"labels": [],
"eventDeliverySchema": "EventGridSchema"
}
}
]

AWS EC2 - starting T2 unlimited instances - bug in EC2 API?

T2 instances can now be started with an additional option to allow more CPU bursting for additional cost.
SDK: http://docs.aws.amazon.com/aws-sdk-php/v3/api/api-ec2-2016-11-15.html#runinstances
I tried it, I can switch my instances to unlimited so it should be possible.
However, I added the new configuration option to the array and nothing changed, it's still set to "standard" as before.
Here a JSON dump of the runinstances option array:
{
"UserData": "....",
"SecurityGroupIds": [
"sg-04df967f"
],
"InstanceType": "t2.micro",
"ImageId": "ami-4e3a4051",
"MaxCount": 1,
"MinCount": 1,
"SubnetId": "subnet-22ec130c",
"Tags": [
{
"Key": "task",
"Value": "test"
},
{
"Key": "Name",
"Value": "unlimitedtest"
}
],
"InstanceInitiatedShutdownBehavior": "terminate",
"CreditSpecification": {
"CpuCredits": "unlimited"
}
}
It starts the EC2 instance successfully just as before, however the CreditSpecification setting is ignored.
Amazon denies normal users to contact support, so I hope maybe someone here has a clue about it.
Hmmm... Using qualitatively the same run-instances JSON
{
"ImageId": "ami-bf4193c7",
"InstanceType": "t2.micro",
"CreditSpecification": {
"CpuCredits": "unlimited"
}
}
worked for me - the instance shows this:
T2 Unlimited Enabled
in the "description" tab after selecting this instance in the ec2 console.

Mesos : Unreserve slaves resources

Is there any way to reset all slaves reserved resources in Mesos, without configuring one by one the /unreserve http endpoint?
In Mesos documentation :
/unreserve (since 0.25.0)
Suppose we want to unreserve the resources that we dynamically reserved above. We can send an HTTP POST request to the master’s /unreserve endpoint like so:
$ curl -i \
-u <operator_principal>:<password> \
-d slaveId=<slave_id> \
-d resources='[
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 8 },
"role": "ads",
"reservation": {
"principal": <reserver_principal>
}
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 4096 },
"role": "ads",
"reservation": {
"principal": <reserver_principal>
}
}
]' \
-X POST http://<ip>:<port>/master/unreserve
Mesos doesn't directly provide any support for unreserving resources at more than one slave using a single operation. However, you can write a script that uses the /unreserve endpoint to unreserves the resources at all the slaves in the cluster, e.g., by fetching the list of slaves and reserved resources from the /slaves endpoint on the master (see the reserved_resources_full key).

Automating Hive Activity using aws

I would like to automate my hive script every day , in order to do that i have an option which is data pipeline. But the problem is there that i am exporting data from dynamo-db to s3 and with a hive script i am manipulating this data. I am giving this input and output in hive-script that's where the problem starts because a hive-activity has to have input and output but i have to give them in script file.
I am trying to find a way to automate this hive-script and waiting for some ideas ?
Cheers,
You can disable staging on Hive Activity to run any arbitrary Hive Script.
stage = false
Do something like:
{
"name": "DefaultActivity1",
"id": "ActivityId_1",
"type": "HiveActivity",
"stage": "false",
"scriptUri": "s3://baucket/query.hql",
"scriptVariable": [
"param1=value1",
"param2=value2"
],
"schedule": {
"ref": "ScheduleId_l"
},
"runsOn": {
"ref": "EmrClusterId_1"
}
},
Another alternative to the Hive Activity, is to use an EMR activity as in the following example:
{
"schedule": {
"ref": "DefaultSchedule"
},
"name": "EMR Activity name",
"step": "command-runner.jar,hive-script,--run-hive-script,--args,-f,s3://bucket/path/query.hql",
"runsOn": {
"ref": "EmrClusterId"
},
"id": "EmrActivityId",
"type": "EmrActivity"
}

Resources