I would like to automate my hive script every day , in order to do that i have an option which is data pipeline. But the problem is there that i am exporting data from dynamo-db to s3 and with a hive script i am manipulating this data. I am giving this input and output in hive-script that's where the problem starts because a hive-activity has to have input and output but i have to give them in script file.
I am trying to find a way to automate this hive-script and waiting for some ideas ?
Cheers,
You can disable staging on Hive Activity to run any arbitrary Hive Script.
stage = false
Do something like:
{
"name": "DefaultActivity1",
"id": "ActivityId_1",
"type": "HiveActivity",
"stage": "false",
"scriptUri": "s3://baucket/query.hql",
"scriptVariable": [
"param1=value1",
"param2=value2"
],
"schedule": {
"ref": "ScheduleId_l"
},
"runsOn": {
"ref": "EmrClusterId_1"
}
},
Another alternative to the Hive Activity, is to use an EMR activity as in the following example:
{
"schedule": {
"ref": "DefaultSchedule"
},
"name": "EMR Activity name",
"step": "command-runner.jar,hive-script,--run-hive-script,--args,-f,s3://bucket/path/query.hql",
"runsOn": {
"ref": "EmrClusterId"
},
"id": "EmrActivityId",
"type": "EmrActivity"
}
Related
I have created a series of Power Automate Flows that connect with Microsoft Forms; many of them have multiple attachment requirements. When the form is submitted, the files are successfully parsed into the correct folders in the form's Sharepoint directory/directories, and the files work properly there. On the Power Automate side, however, only the first attachment goes through successfully; the others throw error messages like this:
Adobe Acrobat Failure alert: Acrobat could not open file because it is either not a supported file type or because the file has been damaged
As the form is organizational, I'm grabbing the user profile to populate some of the areas.
The schema for the JSON parse is:
{
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"link": {
"type": "string"
},
"id": {
"type": "string"
},
"type": {},
"size": {
"type": "integer"
},
"referenceId": {
"type": "string"
},
"driveId": {
"type": "string"
},
"status": {
"type": "integer"
},
"uploadSessionUrl": {}
},
"required": []
}
}
Here's the full automation with sensitive information redacted:
Power Automate Flow 1
Power Automate Flow 2
It would be much appreciated if someone could clarify why this is happening and how to correct it. We have people who work with the emails to process the paperwork and do/should not have access to Sharepoint, so other staff are supply the files by downloading them from their site.
For testing purposes I'm trying to execute this simple pipeline (nothing sophisticated).
However, I'm getting this error:
{"code":"BadRequest","message":null,"target":"pipeline//runid/cb841f14-6fdd-43aa-a9c1-4619dab28cdd","details":null,"error":null}
The goal is to see if two variables are getting the right values (we have been facing some issues in our production environment).
This is the json with the definition of the pipeline:
{
"name": "GeneralTest",
"properties": {
"activities": [
{
"name": "Set variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "start_time",
"value": {
"value": "#utcnow()",
"type": "Expression"
}
}
},
{
"name": "Wait1",
"type": "Wait",
"dependsOn": [
{
"activity": "Set variable1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"waitTimeInSeconds": 5
}
},
{
"name": "Set variable2",
"description": "",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Wait1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "end_time ",
"value": {
"value": "#utcnow()",
"type": "Expression"
}
}
}
],
"variables": {
"start_time": {
"type": "String"
},
"end_time ": {
"type": "String"
}
},
"folder": {
"name": "Old Pipelines"
},
"annotations": []
}
}
What am I missing, or what could be the problem with this process?
You are having a "blank space" after the variable name end_time like "end_time "
You can see the difference in my repro:
MyCode VS YourCode
Clearing that would make the execution just fine.
I faced a similar issue when doing a Debug run of one of my pipelines. The error messages for these types of errors are not helpful when running in Debug mode.
What I have found is that if you publish the pipeline and then Trigger a Pipeline Run (instead of a Debug run), you can then go to Monitor Pipeline Runs and it will show you a more useful error message.
Apart from possible blank spaces in variable or parameter names, Data Factory doesn't like hyphens, but only in parameter names, variables are fine.
Validation passes, but then in debug time you get the same cryptic error
I ran into this same error message in Data Factory today on the Copy activity. Everything passed validation but this error would pop on each debug run.
I have parameters configured on my dataset connections so that I can use dynamic queries against the data sources. In this case, I was using explicit queries, so the parameters appeared irrelevant. I tried with both blank values and value is null. Both failed the same way.
I tried with stupid but real text values and it worked! The pipeline isn't leveraging the stupid values for any work, so their content doesn't matter, but some portion of the engine needs a non-null value in the parameters in order to execute.
I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV.
I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather than standard CSV. I think I've read across my Googling that is it possible to use a normal file but I haven't been able to put together something that works. The AWS documentation hasn't been terribly helpful either. I haven't been able to find reference posts that are relatively recent ( < 2 years old)
If this is possible, can anyone provide some insight on why my pipeline may not be working? I've pasted the pipeline and error message below. The error seems to indicate an issue plugging data into Dynamo, I'm guessing because it's not in the export format.
I'd do it in Lambda but the data load takes longer than 15 minutes.
Thanks
{
"objects": [
{
"myComment": "Activity used to run the hive script to import CSV data",
"output": {
"ref": "dynamoDataTable"
},
"input": {
"ref": "s3csv"
},
"name": "S3toDynamoLoader",
"hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
"id": "S3toDynamoLoader",
"runsOn": { "ref": "EmrCluster" },
"stage": "false",
"type": "HiveActivity"
},
{
"myComment": "The DynamoDB table that we are uploading to",
"name": "DynamoDB",
"id": "dynamoDataTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}",
"writeThroughputPercent": "1.0",
"dataFormat": {
"ref": "DDBTableFormat"
}
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myLogUri}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "EmrCluster",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"releaseLabel": "emr-5.29.0",
"id": "EmrCluster",
"type": "EmrCluster",
"terminateAfter": "2 Hours"
},
{
"myComment": "The S3 file that contains the data we're importing",
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "csvFormat"
},
"name": "S3DataNode",
"id": "s3csv",
"type": "S3DataNode"
},
{
"myComment": "Format for the S3 Path",
"name": "S3ExportFormat",
"column": "not_used STRING",
"id": "csvFormat",
"type": "CSV"
},
{
"myComment": "Format for the DynamoDB table",
"name": "DDBTableFormat",
"id": "DDBTableFormat",
"column": "not_used STRING",
"type": "DynamoDBExportDataFormat"
}
],
"parameters": [
{
"description": "S3 Column Mappings",
"id": "myS3ColDef",
"default": "phoneNumber string,firstName string,lastName string, spend double",
"type": "String"
},
{
"description": "DynamoDB Column Mappings",
"id": "myDDBColDef",
"default": "phoneNumber String,firstName String,lastName String, spend double",
"type": "String"
},
{
"description": "Input S3 foder",
"id": "myInputS3Loc",
"default": "s3://POCproject-dev1-data/upload/",
"type": "AWS::S3::ObjectKey"
},
{
"description": "DynamoDB table name",
"id": "myDDBTableName",
"default": "POCproject-pipeline-data",
"type": "String"
},
{
"description": "S3 to DynamoDB Column Mapping",
"id": "myDDBTableColMapping",
"default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
"type": "String"
},
{
"description": "DataPipeline Log Uri",
"id": "myLogUri",
"default": "s3://POCproject-dev1-data/",
"type": "AWS::S3::ObjectKey"
}
]
}
Error
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_#EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
Have you tried this sample yet? It uses hive to import the CSV file to a DynamoDB table
https://github.com/aws-samples/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV
T2 instances can now be started with an additional option to allow more CPU bursting for additional cost.
SDK: http://docs.aws.amazon.com/aws-sdk-php/v3/api/api-ec2-2016-11-15.html#runinstances
I tried it, I can switch my instances to unlimited so it should be possible.
However, I added the new configuration option to the array and nothing changed, it's still set to "standard" as before.
Here a JSON dump of the runinstances option array:
{
"UserData": "....",
"SecurityGroupIds": [
"sg-04df967f"
],
"InstanceType": "t2.micro",
"ImageId": "ami-4e3a4051",
"MaxCount": 1,
"MinCount": 1,
"SubnetId": "subnet-22ec130c",
"Tags": [
{
"Key": "task",
"Value": "test"
},
{
"Key": "Name",
"Value": "unlimitedtest"
}
],
"InstanceInitiatedShutdownBehavior": "terminate",
"CreditSpecification": {
"CpuCredits": "unlimited"
}
}
It starts the EC2 instance successfully just as before, however the CreditSpecification setting is ignored.
Amazon denies normal users to contact support, so I hope maybe someone here has a clue about it.
Hmmm... Using qualitatively the same run-instances JSON
{
"ImageId": "ami-bf4193c7",
"InstanceType": "t2.micro",
"CreditSpecification": {
"CpuCredits": "unlimited"
}
}
worked for me - the instance shows this:
T2 Unlimited Enabled
in the "description" tab after selecting this instance in the ec2 console.
Is there a way to run an EmrActivity in AWS Data Pipeline on an existing cluster? We currently are using Data Pipeline to run jobs in AWS EMR using EmrCluster and EmrActivity but we'd like to have all pipelines run on the same cluster. I've tried reading the documentation and building a pipeline in architect but I can't seem to find a way to do anything but create a cluster and run jobs on it. There doesn't seem to be a way to define a new pipeline which uses an existing cluster. If there is how would I do it? We're currently using CloudFormation to create our pipelines so if possible an example using CloudFormation would be preferable but I'll take what I can get.
Yes it is possible.
Launch your EMR cluster
Start TaskRunner on the master instance with the option --workerGroup=name-of-the-worker-group
In the activities of your pipeline don't specify runsOn parameter, pass your worker group instead.
Here is an example of the activity with such parameter defined using CloudFormation:
...
{
"Id": "S3ToRedshiftCopyActivity",
"Name": "S3ToRedshiftCopyActivity",
"Fields": [
{
"Key": "type",
"StringValue": "RedshiftCopyActivity"
},
{
"Key": "workerGroup",
"StringValue": "name-of-the-worker-group"
},
{
"Key": "insertMode",
"StringValue": "#{myInsertMode}"
},
{
"Key": "commandOptions",
"StringValue": "FORMAT CSV"
},
{
"Key": "dependsOn",
"RefValue": "RedshiftTableCreateActivity"
},
{
"Key": "input",
"RefValue": "S3StagingDataNode"
},
{
"Key": "output",
"RefValue": "DestRedshiftTable"
}
]
}
...
You can find detailed documentation how to do that here:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html