I see a weird problem where the export-image task is stuck at 85% or can say at the "conversion" step and it doesn't end even after 6 hours waiting.
Steps used are pretty normal :
% aws ec2 export-image --image-id ami-0123c45d6789d012d --disk-image-format VMDK --s3-export-location S3Bucket=ami-snapshots-bucket --region us-west-2
And here is the status stuck at 85% :
% aws ec2 describe-export-image-tasks --export-image-task-ids export-ami-0ddfc0123456789d1 --region us-west-2
{
"ExportImageTasks": [
{
"ExportImageTaskId": "export-ami-0ddfc0123456789d1",
"Progress": "85",
"S3ExportLocation": {
"S3Bucket": "ami-snapshots-bucket"
},
"Status": "active",
"StatusMessage": "converting",
"Tags": []
}
]
}
Anyone with similar issue or know to make this work?
Thanks,
For 80 GB storage, mine finished in about 2 hrs, though it was stuck at 85% for while.
Another internet user also suggests that 3 hrs until completion worked for them for a 40 GB image/snapshot.
Related
How do we assign Jobs to specifying Queue when we have multiple queues. I'm using Yarn hadoop with AWS EMR
On AWS EMR you can create a cluster with Spark installed and set spark.scheduler.mode , using the following command, which references a file, myConfig.json stored in Amazon S3.
aws emr create-cluster --release-label emr-5.36.0 --applications Name=Spark \
--instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json
myConfig.json:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.scheduler.mode": "FAIR"
}
}
]
Or you can specify which scheduler to use when initializing the job resource with the following parameters
val sparkConf = new SparkConf()
sparkConf.set("spark.scheduler.mode", "FAIR")
...
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
I tried adding service principal to azure databricks workspace using cloud shell but getting error. I am able to look at all the clusters in the work space and I was the one who created that workspace. Do I need to be in admin group if I want to add Service Principal to workspace?
curl --netrc -X POST \ https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.net/api/2.0/preview/scim/v2/ServicePrincipals \ --header 'Content-type: application/scim+json' \ --data #create-service-principal.json \ | jq .
file has following info:
{ "displayName": "sp-name", "applicationId": "a9217fxxxxcd-9ab8-dxxxxxxxxxxxxx", "entitlements": [ { "value": "allow-cluster-create" } ], "schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ], "active": true }
Here is the error I am getting: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 279 100 279 0 0 5166 0 --:--:-- --:--:-- --:--:-- 5264 parse error: Invalid numeric literal at line 2, column 0
Do I need to be in admin group if I want to add Service Principal to workspace?
Issue is with JSON file not with access to admin group.
You need to check double quotes in line number 2 of your JSON file.
You can refer this github link
Try this code in Python that you can run in a Databricks notebook:
import pandas
import json
import requests
# COMMAND ----------
# MAGIC %md ### define variables
# COMMAND ----------
pat = 'EnterPATHere' # paste PAT. Get it from settings > user settings
workspaceURL = 'EnterWorkspaceURLHere' # paste the workspace url in the format of 'https://adb-1234567.89.azuredatabricks.net'
applicationID = 'EnterApplicationIDHere' # paste ApplicationID / ClientID of Service Principal. Get it from Azure Portal
friendlyName = 'AddFriendlyNameHere' # paste FriendlyName of ServicePrincipal. Get it from Azure Portal
# COMMAND ----------
# MAGIC %md ### add service principal
# COMMAND ----------
payload_raw = {
'schemas':
['urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal'],
'applicationId': applicationID,
'displayName': friendlyName,
'groups':[],
'entitlements':[]
}
payload = json.loads(json.dumps(payload_raw))
response = requests.post(workspaceURL + '/api/2.0/preview/scim/v2/ServicePrincipals',\
headers = {'Authorization' : 'Bearer '+ pat,\
'Content-Type': 'application/scim+json'},\
data=json.dumps(payload))
response.content
I have actually published a blog post where a Python script is provided to fully manage service principals and access control in Databricks workspaces.
My application on Heroku servers is located in the EU region. But that's all the info I get.
The EU region has two locations: Ireland and Germany. I need to know in which of this region my app is located.
I can display some regions using heroku regions command but it's useless as it just shows me all available regions.
The output of the heroku info <APPNAME> heroku regions --json command can be used in combination with this AWS IP Ranges json information to figure out which exact region you're looking at.
When you run heroku info <APPNAME>, you will see the general region. But it's broad and not very useful.
For example, in this case, it's us
=== APPNAME
Auto Cert Mgmt: false
Dynos: web: 1
Git URL: https://git.heroku.com/APPNAME.git
Owner: example#email.com
Region: us
Repo Size: 0 B
Slug Size: 37 MB
Stack: heroku-20
Web URL: https://APPNAME.herokuapp.com/
If you look at the heroku regions info from heroku regions --json, you'll see there will be an entry where the "name" will match the "region" from heroku info.
Here, it's us-east-1
...
{
"country": "United States",
"created_at": "2012-11-21T20:44:16Z",
"description": "United States",
"id": "59accabd-516d-4f0e-83e6-6e3757701145",
"locale": "Virginia",
"name": "us",
"private_capable": false,
"provider": {
"name": "amazon-web-services",
"region": "us-east-1"
},
"updated_at": "2016-08-09T22:03:28Z"
},
...
From there, you can see the "provider" key in that region's json, and it'll show you the provider's name (likely AWS) and region (should be a more fine-grained region than the general heroku region).
Now you can take that region name, and look in the AWS IP Ranges JSON file for it -- that should give you the IPs that are associated with it. (Note, there will be many IP addresses associated)
Note that there will be many IP prefixes which have a matching region, because there are many IPs associated with it.
Here is one example of an entry which matches the us-east-1 region.
...
{
"ip_prefix": "52.94.152.9/32",
"region": "us-east-1",
"service": "AMAZON",
"network_border_group": "us-east-1"
},
...
You can use heroku info <APP-NAME> to find the Region ID. It will display some data like below.
=== APP-NAME
Addons: heroku-postgresql:hobby-dev
Auto Cert Mgmt: false
Collaborators: youremail#email.com
Dynos: web: 1
Git URL: https://git.heroku.com/appname.git
Owner: appname#email.com
Region: us
Repo Size: 0 B
Slug Size: 168 MB
Stack: heroku-18
Web URL: https://appname.herokuapp.com/
As you can see the Region ID is us. Then, you can use heroku regions command to find the regions belong to the ID.
ID Location Runtime
───────── ─────────────────────── ──────────────
eu Europe Common Runtime
us United States Common Runtime
dublin Dublin, Ireland Private Spaces
frankfurt Frankfurt, Germany Private Spaces
oregon Oregon, United States Private Spaces
sydney Sydney, Australia Private Spaces
tokyo Tokyo, Japan Private Spaces
virginia Virginia, United States Private Spaces
I have a json that looks like this:
{
"failedSet": [],
"successfulSet": [{
"event": {
"arn": "arn:aws:health:us-east-1::event/AWS_RDS_MAINTENANCE_SCHEDULED_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx",
"endTime": 1502841540.0,
"eventTypeCategory": "scheduledChange",
"eventTypeCode": "AWS_RDS_MAINTENANCE_SCHEDULED",
"lastUpdatedTime": 1501208541.93,
"region": "us-east-1",
"service": "RDS",
"startTime": 1502236800.0,
"statusCode": "open"
},
"eventDescription": {
"latestDescription": "We are contacting you to inform you that one or more of your Amazon RDS DB instances is scheduled to receive system upgrades during your maintenance window between August 8 5:00 PM and August 15 4:59 PM PDT. Please see the affected resource tab for a list of these resources. \r\n\r\nWhile the system upgrades are in progress, Single-AZ deployments will be unavailable for a few minutes during your maintenance window. Multi-AZ deployments will be unavailable for the amount of time it takes a failover to complete, usually about 60 seconds, also in your maintenance window. \r\n\r\nPlease ensure the maintenance windows for your affected instances are set appropriately to minimize the impact of these system upgrades. \r\n\r\nIf you have any questions or concerns, contact the AWS Support Team. The team is available on the community forums and by contacting AWS Premium Support. \r\n\r\nhttp://aws.amazon.com/support\r\n"
}
}]
}
I'm trying to add a new key/value under successfulSet[].event (key name as affectedEntities) using jq, I've seen some examples, like here and here, but none of those answers really show how to add a possible one key with multiple values (the reason why I say possible is because as of now, AWS is returning one value for the affected entity, but if there are more, then I'd like to list them).
EDIT: The value of the new key that I want to add is stored in a variable called $affected_entities and a sample of that value looks like this:
[
"arn:aws:acm:us-east-1:xxxxxxxxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
]
The value could look like this:
[
"arn:aws:acm:us-east-1:xxxxxxxxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx",
"arn:aws:acm:us-east-1:xxxxxxxxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx",
...
...
...
]
You can use this jq,
jq '.successfulSet[].event += { "new_key" : "new_value" }' file.json
EDIT:
Try this:
jq --argjson argval "$new_value" '.successfulSet[].event += { "affected_entities" : $argval }' file.json
Test:
sat~$ new_value='[
"arn:aws:acm:us-east-1:xxxxxxxxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
]'
sat~$ jq --argjson argval "$new_value" '.successfulSet[].event += { "affected_entities" : $argval }' file.json
Note that --argjson works with jq 1.5 and above.
In Spark, version 1.6.1 (code is in Scala 2.10), I am trying to write a data frame to a Parquet file:
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Overwrite).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
When I do it in development mode, everything works fine. It also works fine if I setup a master and one worker in standalone mode in a docker environment (separate docker containers) on the same machine. It fails when I try to execute it on a cluster (1 master, 5 workers). If I set it up local on the master it also works.
When I try to execute it, I get following stacktrace:
{
"duration": "18.716 secs",
"classPath": "LDFSparkLoaderJobTest2",
"startTime": "2016-07-18T11:41:03.299Z",
"context": "sql-context",
"result": {
"errorClass": "org.apache.spark.SparkException",
"cause": "Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, curry-n3): java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:",
"stack":[
"org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)",
"scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
"scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)",
"org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"scala.Option.foreach(Option.scala:236)",
"org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)",
"org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)",
"org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)",
"org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)",
"org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)",
"org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)",
"org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)",
"org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)",
"LDFSparkLoaderJobTest2$.readFile(SparkLoaderJob.scala:55)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:48)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:18)",
"spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:268)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"
],
"causingClass": "org.apache.spark.SparkException",
"message": "Job aborted."
},
"status": "ERROR",
"jobId": "54ad3056-3aaa-415f-8352-ca8c57e02fe9"
}
Notes:
The job is submitted via the Spark Jobserver.
The file that needs to be converted to a Parquet file is 15.1 MB in size.
Question:
Is there something I am doing wrong (I followed the docs)
Or is there another way I can create the Parquet file, so all my workers have access to it?
In your stand alone setup only one worker is working with ParquetRecordWriter. so it worked fine.
In case of real test i.e. cluster (1 master, 5 workers). with ParquetRecordWriter it will fail since you are concurrently writing with multiple workers...
pls try below.
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Append).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
pls. see SaveMode.Append "append" When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
I had not exactly the same, but similar issues writing dataframes to parquet files in cluster mode. Those problems disappeared when deleteing the file, just before writing, using this convenience function 'write(..)' :
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
..
def main(arg: Array[String]) {
..
val fs = FileSystem.get(sc.hadoopConfiguration)
..
def write(df:DataFrame, fn:String ) = {
val op1=s"hdfs:///user/you/$fn"
fs.delete(new Path(op1))
df.write.parquet(op1)
}
Give it a go, tell me if it works for you...