Create a new cluster in Databricks using databricks-cli

Create a new cluster in Databricks using databricks-cli - bash

I'm trying to create a new cluster in Databricks on Azure using databricks-cli.
I'm using the following command:
databricks clusters create --json '{ "cluster_name": "template2", "spark_version": "4.1.x-scala2.11" }'
And getting back this error:
Error: {"error_code":"INVALID_PARAMETER_VALUE","message":"Missing required field: size"}
I can't find documentation on this issue, would be happy to receive some help.

I found the right answer here.
The correct format to run this command on azure is:
databricks clusters create --json '{ "cluster_name": "my-cluster", "spark_version": "4.1.x-scala2.11", "node_type_id": "Standard_DS3_v2", "autoscale" : { "min_workers": 2, "max_workers": 50 } }'

Just to add to the answer that #MorShemesh gave, you can also use a path to a JSON file instead of specifying the JSON at the command line.
databricks clusters create --json-file /path/to/my/cluster_config.json
If you are managing lots of clusters this might be an easier approach.

databricks clusters create --json "{ "cluster_name": "custpm-cluster", "spark_version": "4.1.x-scala2.09", "node_type_id": "Standard_DS3_v2", "autoscale" : { "min_workers": 2, "max_workers": 50 }}"

Related

kafka.common.KafkaException: Failed to parse the broker info from zookeeper from EC2 to elastic search

I have aws MSK set up and i am trying to sink records from MSK to elastic search.
I am able to push data into MSK into json format .
I want to sink to elastic search .
I am able to do all set up correctly .
This is what i have done on EC2 instance
wget /usr/local http://packages.confluent.io/archive/3.1/confluent-oss-3.1.2-2.11.tar.gz -P ~/Downloads/
tar -zxvf ~/Downloads/confluent-oss-3.1.2-2.11.tar.gz -C ~/Downloads/
sudo mv ~/Downloads/confluent-3.1.2 /usr/local/confluent
/usr/local/confluent/etc/kafka-connect-elasticsearch
After that i have modified kafka-connect-elasticsearch and set my elastic search url
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=AWSKafkaTutorialTopic
key.ignore=true
connection.url=https://search-abcdefg-risdfgdfgk-es-ex675zav7k6mmmqodfgdxxipg5cfsi.us-east-1.es.amazonaws.com
type.name=kafka-connect
The producer sends message like below fomrat
{
"data": {
"RequestID": 517082653,
"ContentTypeID": 9,
"OrgID": 16145,
"UserID": 4,
"PromotionStartDateTime": "2019-12-14T16:06:21Z",
"PromotionEndDateTime": "2019-12-14T16:16:04Z",
"SystemStartDatetime": "2019-12-14T16:17:45.507000000Z"
},
"metadata": {
"timestamp": "2019-12-29T10:37:31.502042Z",
"record-type": "data",
"operation": "insert",
"partition-key-type": "schema-table",
"schema-name": "dbo",
"table-name": "TRFSDIQueue"
}
}
I am little confused in how will the kafka connect start here ?
if yes how can i start that ?
I also have started schema registry like below which gave me error.
/usr/local/confluent/bin/schema-registry-start /usr/local/confluent/etc/schema-registry/schema-registry.properties
When i do that i get below error
[2019-12-29 13:49:17,861] ERROR Server died unexpectedly: (io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain:51)
kafka.common.KafkaException: Failed to parse the broker info from zookeeper: {"listener_security_protocol_map":{"CLIENT":"PLAINTEXT","CLIENT_SECURE":"SSL","REPLICATION":"PLAINTEXT","REPLICATION_SECURE":"SSL"},"endpoints":["CLIENT:/
Please help .
As suggested in answer i upgraded the kafka connect version but then i started getting below error
ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:63)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:210)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:61)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:72)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:39)
at io.confluent.rest.Application.createServer(Application.java:201)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:41)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:168)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:111)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:208)
... 5 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:161)
... 7 more

First, Confluent Platform 3.1.2 is fairly old. I suggest you get the version that aligns with the Kafka version
You start Kafka Connect using the appropriate connect-* scripts and properties located under bin and etc/kafka folders
For example,
/usr/local/confluent/bin/connect-standalone \
/usr/local/confluent/etc/kafka/kafka-connect-standalone.properties \
/usr/local/confluent/etc/kafka-connect-elasticsearch/quickstart.properties
If that works, you can move onto using connect-distributed command instead
Regarding Schema Registry, you can search its Github issues for multiple people trying to get MSK to work, but the root issue is related to MSK not exposing a PLAINTEXT listener and the Schema Registry not supporting named listeners. (This may have changed since versions 5.x)
You could also try using Connect and Schema Registry containers in ECS / EKS rather than extracting in an EC2 machine

Sqoop through JAVA API

We are trying to sqoop data from mysql to HDFS. When we run the code the data gets stored in local file system. We want the data to be in HDFS. Can any one suggest us with the following code?
SqoopOptions options = new SqoopOptions();
options.setConnectString("jdbc:mysql:hostname/db_name");
options.setUsername("user");
options.setPassword("pass");
options.setTableName("table");
options.setDirectMode(true);
options.setNumMappers(4);
options.setDriverClassName("com.mysql.jdbc.Driver");
options.setSqlQuery("select * from table");
options.setWhereClause("value > 15.0");
options.setTargetDir("output");
options.doHiveImport();
System.out.println();
int ret=new ImportTool().run(options);
System.out.println(ret);

I ran the same program in hdfs and got the output :)

Here the issue is with options.setTargetDir("output");
You are not specifying a qualifying HDFS path. If you change "output" with a valid HDFS path, you should be able to run the code from anywhere and still get a proper result.

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?

ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.

If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

play2-elastic does not work when ElasticSearch is installed on EC2 server

When I'm trying to connect to ElasticSearch (elasticsearch-0.90.3) installed on EC2 from a none local machine using play2-elastic plugin it throws the following exception (the plugin works fine when connecting locally)
error] application - ElasticSearch : No ElasticSearch node is available. Please check that your configuration is correct, that you ES server is up and reachable from the network. Index has not been created and prepared.
org.elasticsearch.client.transport.NoNodeAvailableException: No node available
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205) ~[elasticsearch-0.90.3.jar:na]
at org.elasticsearch.client.transport.support.InternalTransportIndicesAdminClient.execute(InternalTransportIndicesAdminClient.java:85) ~[elasticsearch-0.90.3.jar:na]
at org.elasticsearch.client.support.AbstractIndicesAdminClient.exists(AbstractIndicesAdminClient.java:147) ~[elasticsearch-0.90.3.jar:na]
at org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsRequestBuilder.doExecute(IndicesExistsRequestBuilder.java:43) ~[elasticsearch-0.90.3.jar:na]
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85) ~[elasticsearch-0.90.3.jar:na]
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59) ~[elasticsearch-0.90.3.jar:na]
I have used different methods to test the elasticsearch server is up and running, examples:
curl -XGET '184.72.55.204:9300/_analyze?analyzer=standard' -d 'this is a test'
curl: (52) Empty reply from server
telnet 184.72.55.204 9300
Trying 184.72.55.204...
Connected to ec2-184-72-55-204.us-west-1.compute.amazonaws.com.
Escape character is '^]'.
In some google groups I also saw other people having similar problem, they seem to be able to fix the problem with turning sniffing to off, so I have this in my application.conf
elasticsearch.client="184.72.55.204:9300"
elasticsearch.sniff=false # I ADDED THIS BUT DID NOT HELP
elasticsearch.index.name="phonotags"
elasticsearch.index.settings="{ analysis: { analyzer: { my_analyzer: { type: \"custom\", tokenizer: \"standard\" } } } }"
elasticsearch.index.clazzs="indexing.*"
elasticsearch.index.show_request=true
my build.scala file contains these:
"com.clever-age" % "play2-elasticsearch" % "0.7-SNAPSHOT"
resolvers += Resolver.url("play-plugin-releases", new URL("http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/"))(Resolver.ivyStylePatterns),
resolvers += Resolver.url("play-plugin-snapshots", new URL("http://repo.scala-sbt.org/scalasbt/sbt-plugin-snapshots/"))(Resolver.ivyStylePatterns)
I appreciate your help.
thanks

It seems your node is not available
curl -XPUT '184.72.55.204:9200/twitter/tweet/1' -d '{ "user": "kimchy", "post_date" : "2011-08-18T16:20:00", "message" : "trying out Elastic Search" }'
Can you check this ?

Error while connecting Elastic Map Reduce ruby client

I am following the steps mentioned on the AWS to use an interactive Hive session using SSH.
I used the following resources
https://github.com/ucbtwitter/getting-started/wiki/Using-Elastic-Map-Reduce-via-Command-Line
http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/SignUp.html
I was getting this error initially "Error: Missing key access-id" and then I fixed my JSON file. The JSON file is in the same format as mentioned in the above links.
When I run this command
./elastic-mapreduce
I am getting the following error :-
Error: Unable to parse credentials.json: can't convert String into Integer.
I checked the values required in JSON at AWS as well.
Does anyone has an idea why am I getting this error?

The region value in the credentials.json must be of int type.
{......
......
"region": 1
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Create a new cluster in Databricks using databricks-cli - bash

I found the right answer here. The correct format to run this command on azure is: databricks clusters create --json '{ "cluster_name": "my-cluster", "spark_version": "4.1.x-scala2.11", "node_type_id": "Standard_DS3_v2", "autoscale" : { "min_workers": 2, "max_workers": 50 } }'

Just to add to the answer that #MorShemesh gave, you can also use a path to a JSON file instead of specifying the JSON at the command line. databricks clusters create --json-file /path/to/my/cluster_config.json If you are managing lots of clusters this might be an easier approach.

databricks clusters create --json "{ "cluster_name": "custpm-cluster", "spark_version": "4.1.x-scala2.09", "node_type_id": "Standard_DS3_v2", "autoscale" : { "min_workers": 2, "max_workers": 50 }}"

Related

kafka.common.KafkaException: Failed to parse the broker info from zookeeper from EC2 to elastic search

Sqoop through JAVA API

pig + hbase + hadoop2 integration

play2-elastic does not work when ElasticSearch is installed on EC2 server

Error while connecting Elastic Map Reduce ruby client

Categories

Resources