Producer Avro data from Windows with Docker - windows

I'm following How to transform a stream of events tutorial.
Everything works fine until topic creation part:
Under title Produce events to the input topic:
docker exec -i schema-registry /usr/bin/kafka-avro-console-producer --topic raw-movies --bootstrap-server broker:9092 --property value.schema="$(< src/main/avro/input_movie_event.avsc)"
I'm getting:
<: The term '<' is not recognized as the name of a cmdlet, function,
script file, or operable program. Check the spelling of the name, or
if a path was included, verify that the path is correct and try again.
What would be proper way of calling Avro schema file in --property value.schema ?
All Confluent Kafka servers are running fine.
Schema registry is empty at this point:
PS C:\Users\Joe> curl -X GET http://localhost:8081/subjects
[]
How can I register Avro file in Schema manually from CLI? I'm not finding options for that in Schema Registry API..
My thinking was - if I insert schema manually than I would be able to call it this way.
EDIT 1
Tried entering Avro file path as variable in Power shell like:
$avroPath = 'D:\ConfluentKafkaDocker\kafkaStreamsDemoProject\src\main\avro\input_movie_event.avsc'
And than executing:
docker exec -i schema-registry /usr/bin/kafka-avro-console-producer --topic raw-movies --bootstrap-server broker:9092 --property value.schema=$avroPath
But that didn't work.
EDIT 2
Manage to get it working with:
$avroPath = 'D:\ConfluentKafkaDocker\kafkaStreamsDemoProject\src\main\avro\input_movie_event.avsc'
docker exec -i schema-registry /usr/bin/kafka-avro-console-producer --topic raw-movies --bootstrap-server broker:9092 --property value.schema.file=$avroPath
But now I'm getting:
org.apache.kafka.common.config.ConfigException: Error reading schema
from
D:\ConfluentKafkaDocker\kafkaStreamsDemoProject\src\main\avro\input_movie_event.avsc
at io.confluent.kafka.formatter.SchemaMessageReader.getSchemaString(SchemaMessageReader.java:260)
at io.confluent.kafka.formatter.SchemaMessageReader.getSchema(SchemaMessageReader.java:222)
at io.confluent.kafka.formatter.SchemaMessageReader.init(SchemaMessageReader.java:153)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:43)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
input_movie_event.avsc:
{
"namespace": "io.confluent.developer.avro",
"type": "record",
"name": "RawMovie",
"fields": [
{"name": "id", "type": "long"},
{"name": "title", "type": "string"},
{"name": "genre", "type": "string"}
]
}
It's copy-pasted from example so I see not reason why it would be incorrectly formatted.
EDIT 3
Tried with forward slash since Power shell works now with it:
value.schema.file=src/main/avro/input_movie_event.avsc
and than with backslash:
value.schema.file=src\main\avro\input_movie_event.avsc
I'm getting same error as in Edit 2 - so it looks like this flag value.schema.file is not working properly.
EDIT 4
tried with value.schema="$(cat src/main/avro/input_movie_event.avsc)" as suggested here:
Error I'm getting now is:
[2022-04-05 10:17:24,135] ERROR Could not parse Avro schema
(io.confluent.kafka.schemaregistry.avro.AvroSchemaProvider)
org.apache.avro.SchemaParseException:
com.fasterxml.jackson.core.JsonParseException: Unexpected character
('n' (code 110)): was expecting double-quote to start field name at
[Source: (String)"{ namespace: io.confluent.developer.avro, type:
record, name: RawMovie, fields: [ {name: id, type: long},
{name: title, type: string}, {name: genre, type: string} ] }";
line: 1, column: 6]
at org.apache.avro.Schema$Parser.parse(Schema.java:1427)
at org.apache.avro.Schema$Parser.parse(Schema.java:1413)
at io.confluent.kafka.schemaregistry.avro.AvroSchema.(AvroSchema.java:70)
at io.confluent.kafka.schemaregistry.avro.AvroSchemaProvider.parseSchema(AvroSchemaProvider.java:54)
at io.confluent.kafka.schemaregistry.SchemaProvider.parseSchema(SchemaProvider.java:63)
at io.confluent.kafka.formatter.SchemaMessageReader.parseSchema(SchemaMessageReader.java:212)
at io.confluent.kafka.formatter.SchemaMessageReader.getSchema(SchemaMessageReader.java:224)
at io.confluent.kafka.formatter.SchemaMessageReader.init(SchemaMessageReader.java:153)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:43)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)
In error it says that it was expecting double-quote to start field name and also that name: id and in file I have:
"fields": [
{"name": "id", "type": "long"},
{"name": "title", "type": "string"},
{"name": "genre", "type": "string"}
]
Why is it parsing it incorrectly, like there are not double-quotes when in file they are actually there?
EDIT 6
tried with value.schema="$(type src/main/avro/input_movie_event.avsc)"
since type is equivalent for cat on Windows - got same error as in Edit 5.
Tried with get-content as suggested here - same error.

How can I register Avro file in Schema manually from CLI?
You would not use a Producer, or Docker.
You can use Postman and send POST request (or the Powershell equivalent of curl) to the /subjects endpoint, like the Schema Registry API documentation says for registering schemas.
After that, using value.schema.id, as linked, will work.
Or, if you don't want to install anything else, I'd stick with value.schema.file. That being said, you must start the container with this file (or whole src\main\avro folder) mounted as a Docker volume, which would not be referenced by a Windows path when you actually use it as part of a docker exec command. My linked answer referring to the cat usage assumes your files are on the same filesystem.
Otherwise, the exec command is being interpreted by Powershell, first, so input redirection won't work for value.schema, and type would be the correct CMD command, but $() syntax might not be, as that's for UNIX shells;
Related - PowerShell: Store Entire Text File Contents in Variable

Related

Get failed service name inside Consul watch handler

I'm using Consul to monitor services health status. I use Consul watch command to fire a handler when some service is failed. Currently I'm using this command:
consul watch -type=checks -state=passing /home/consul/health.sh
This works, however I'd like to know inside health.sh the name of the failed service, so I could send a proper alert message containing failed service name. How can I get failed service name there?
Your script could get all the required information by reading from stdin. Information will be sent as JSON. You can easily examine those events by simply adding cat - | jq . into your handler.
The check information outputted by consul watch -type=check contains a ServiceName field that contains the name of the service the check is associated with.
[
{
"Node": "foobar",
"CheckID": "service:redis",
"Name": "Service 'redis' check",
"Status": "passing",
"Notes": "",
"Output": "",
"ServiceID": "redis",
"ServiceName": "redis"
}
]
(See https://www.consul.io/docs/dynamic-app-config/watches#checks for official docs.)
Checks associated with services should have values in both the ServiceID and ServiceName fields. These fields will be empty for node level checks.
The following command watches changes in health checks, and outputs the name of a service when its check transitions to a state other than "passing" (i.e., warning or critical).
$ consul watch -type=checks -state=passing "jq --raw-output '.[] | select(.ServiceName!=\"\" and .Status!=\"passing\") | .ServiceName'"

How to bulk load data into a dgraph/standalone:graphql container?

Assuming I've a db like the quick-start of https://graphql.dgraph.io/docs/quick-start/
i.e.
type Product {
productID: ID!
name: String #search(by: [term])
reviews: [Review] #hasInverse(field: about)
}
type Customer {
custID: ID!
name: String #search(by: [hash, regexp])
reviews: [Review] #hasInverse(field: by)
}
type Review {
id: ID!
about: Product! #hasInverse(field: reviews)
by: Customer! #hasInverse(field: reviews)
comment: String #search(by: [fulltext])
rating: Int #search
}
Now I would like to import millions of entries and therefore would like to use the bulk loader. My dataset is a bug folder full of .json files.
To what I've seen, I should be able to run a command like
dgraph bulk -f folderOfJsonFiles -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080
But to run my server, I am using the dgraph/standalone:graphql image ran docker run -v $(pwd):/dgraph -p 9000:9000 -it dgraph/standalone:graphql
Now how to start the bulk import ?
1:
Should I run the command within the docker container itself (and share the volume (folder) containing all my .json files ) or install dgraph on my host and run the dgraph bulk command from the host ?
2: What should be the format of the .json files ?
3: Would the bulk loader support blank nodes (id which are not _:0x1234) ?
[edit]
bulk loader seems not to support graphql schema, the schema should be converted to rdf first. To achieve this, I exported the schema and data right after importing the graphql schema curl 'localhost:8080/admin/export?format=json'
Here a few things to understand:
the bulk loader is not an offline version of the live loader. It is a tool which purpose is to prepare the data for the Dgraph Alpha(s) server(s).
the bulk loader, seems to be only able to load triples
the bulk loader can load a schema and files but this is not the graphql schema, the graphql schema must be loaded apart later.
So to answer the question:
start the dgraph graphql server using docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql for your information, this image launch the /tmp/run.sh script which will itself run dgraph-ratel & dgraph zero & dgraph alpha --lru_mb $lru_mb & dgraph graphql (where lru_mb is the memory you give to dgraph alpha). Keep the container's id for later find it using docker ps if you lost it.
Unless you have + 5 millions of entries (or no time), try using the live loader. If you have troubles with the live loader like: it became very slow after few hundred of thousands entries (300k in my case), this is very likely because your alpha does not have sufficient memory. In my case, I had to tune docker to provide 16Gb of memory to the engine, the script gives to the $lru_mb variable a third of the host memory.
Once you imported your full set of data using live loader, you can export the data using docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json, the export will generate 2 files for instance: g01.json.gzand g01.schema.gz which corresponds to your entries and their schema (which is not the graphql schema).
To import those 2 files g01.json.gzand g01.schema.gz back to your dgraph graphql instance, you need to convert them to group’s "p" directory output. To what I understood, the "p" directory holds all the data for the Dgraph Alpha. If you delete it, you lose your data, if you replace it with another set, you will replace / restore the data with the one you just copied. Bulk loader is not an instance of dgraph, it is only the tool which will generate those "p" directory outputs. I have been successful running it within the container. Just run docker exec -it yourDockerContainerId dgraph bulk -f export/pathTo/g01.json.gz -s export/pathTo/g01.schema.gz --map_shards=1 --reduce_shards=1 --http localhost:8001 --zero=localhost:5080. I will be honest, I do not understand the purpose of the http localhost:8001 argument in this command. If the bulk loader ran successfully, it created an out/0/p folder containing the data you can use in your Dgraph Alpha. Stop your docker container docker stop yourDockerContainerId then Replace your current Dgraph Alpha's p folder with the one generated by bulk loader. (Re)start your docker container and you should have your imported data. (perhaps trash the w and zw folders as well, I have no clue about their use).
The data is imported but you will have an warning saying something like there is no graphql schema. Okay let's import our schema (assuming you have it at path dgraph/schemas/schema.graphql) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d #- This might take few minutes as graph will likely have to index your data according to your graphql schema's indexing rule (typically related to the #search decorator)
You're done…
Now, I am still not completely answering the question because the data we are importing back is the one we just exported (and the one we actually imported using the live loader). So unfortunately, the bulk loader cannot import nice data like live loader, you have to feed him with triples. Therefore you have to prepare the data to load using bulk loader in that format. To help you in this talk, I suggest to
Run the dgraph graphql server docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql
import a graphql schema (assuming the schema is at path dgraph/schemas/schema.graphql ) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d #-
create one or two basic / template entries using a graphql client. You can install the Altair chrome extension, connect to http://localhost:9000/graphql then add some data, something like:
mutation {
addCustomer(input:{name:"Toto"}){
name
}
}
You can also using a file and the live loader
Then export your small template data docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json
Open the g01.json.gz and you will find an example of the data the bulk loader expects to be fed with.
What about blank ids ? I am not sure but as the bulk loader is doing a 2 levels mapping on ids, I can imagine you can provide your ids and those will be converted to dgraph ids later.

GUI or CLI to create parquet file

I want to provide the people I work with, a tool to create parquet files to be use for unit-tests of modules that read and process such files.
I use ParquetViewer to view the content of parquet files, but I like to have a tool to make (sample) parquet files. Is there such a tool to create parquet file with a GUI or some practical CLI otherwise?
Note: I would prefer a cross-platform solution, but if not I am looking for a windows/mingw solution in order to use it at work - where I cannot choose the OS :\
parquet-cli written in Java can convert from CSV to parquet.
(This is a sample on Windows)
test.csv is below:
emp_id,dept_id,name,created_at,updated_at
1,1,"test1","2019-02-17 10:00:00","2019-02-17 12:00:00"
2,2,"test2","2019-02-17 10:00:00","2019-02-17 12:00:00"
It requires winutils on Windows. Download and set environment value.
$ set HADOOP_HOME=D:\development\hadoop
Clone parquet-mr, build all and run 'convert-csv' command of parquet-cli.
$ cd parquet-cli
$ java -cp target/classes;target/dependency/* org.apache.parquet.cli.Main convert-csv C:\Users\foo\Downloads\test.csv -o C:\Users\foo\Downloads\test-csv.parquet
'cat' command shows the content of that parquet file.
$ java -cp target/classes;target/dependency/* org.apache.parquet.cli.Main cat C:\Users\foo\Downloads\test-csv.parquet
{"emp_id": 1, "dept_id": 1, "name": "test1", "created_at": "2019-02-17 10:00:00", "updated_at": "2019-02-17 12:00:00"}
{"emp_id": 2, "dept_id": 2, "name": "test2", "created_at": "2019-02-17 10:00:00", "updated_at": "2019-02-17 12:00:00"}
copying from this answer: https://stackoverflow.com/a/74010417/220997
You can use DBeaver to create parquet files. Cross-platform. Create an in-memory DuckDB database and then write a query. Some examples here: https://duckdb.org/docs/data/parquet
It still requires some technical knowledge but it's not too bad.
Example code using to output one record with one column.
COPY (SELECT 'test1' as col1) TO 'C:\Users\name\Desktop\result-snappy.parquet' (FORMAT 'parquet');
You can use the same process to view the file.
SELECT * FROM read_parquet('C:\Users\name\Desktop\result-snappy.parquet');

Parse VMware REST API response

I'm trying to parse a json response from a REST API call. My awk is not strong. This is a bash shell script, and I use curl to get the response and write it to a file. My problem is solely trying to cut the response up into useful parts.
The response is all run together on one line and looks like this:
{
"value": {
"summary": "Patch for VMware vCenter Server Appliance 6.5.0",
"install_time": "2017-03-22T22:43:25 UTC",
"product": "VMware vCenter Server Appliance",
"build": "5178943",
"releasedate": "March 14, 2017",
"type": "vCenter Server with an external Platform Services Controller",
"version": "6.5.0.5300"
}
}
I'm interested in simply writing the type, version, and product strings into a log file. Ideally on 3 lines, but I really don't care; I simply need to be able to identify the build etc at the time this backup script ran, so if I need to rebuild & restore I can make sure I have a compatible build.
Your Rest API gives you JSON format, it's best suited for a JSON parser like jq :
curl -s '/rest/endpoint' | jq -r '.value | .type,.version,.product' > config.txt

How can multiple files be specified with "-files" in the CLI of Amazon for EMR?

I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows:
aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,-
files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-
input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
--ami-version 3.1.0
--instance-groupsInstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
--log-uri s3://betaestimationtest/logs
However, Hadoop now complains that it cannot find the reducer file:
Caused by: java.io.IOException: Cannot run program "reducer.py": error=2, No such file or directory
What am I doing wrong? The file does exist in the folder I specify
For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst. See example 13.
Hope it helps!
You are specifying -files twice, you only need to specify once. I forget if the CLI needs the separator to be a space or a comma for multiple values, but you can try that out.
You should replace:
Args=[-files,s3://betaestimationtest/mapper.py,-files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
with:
Args=[-files,s3://betaestimationtest/mapper.py s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
or if that fails, with:
Args=[-files,s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
Add an escape for comma separating files:
Args=[-files,s3://betaestimationtest/mapper.py\\,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]

Resources