What are the folders generated by RocksDB - apache-kafka-streams

In my kubernetes container I have configured my rocksdb for my kafka streams application
containers:
...
volumeMounts:
- mountPath: /tmp/state
name: state-volume
volumes:
- name: state-volume
emptyDir:
medium: Memory
After the deployment if I go to the location cd /tmp/state/my-service
I see a dozen of folders. E.g. 0_105, 0_107, 0_9 etc. If I go to one of these folders e.g. cd /tmp/state/my-service/0_105/rocksdb/state I see the actual .sst files.
Inside each of these 0_* folders there are multiple sst files which is how rocksdb splits the files keys I suppose. But what do those 0_* folders do and how are they splitted the way they are?

Kafka Streams create "tasks" based on your program structure and the number of input topic partitions. The directories you sees are task directories that allow each task to store its state locally in an isolated manner.
Each task has a task id Y_Z that is used as task directory name. The first number is the sub-topolgoy ID and the second number is basically the input partition number.
You can see what sub-topologies are generated by describing your topology via Topology#describe() (you can just call print the `TopologyDescription' to stdout if you want).

Related

How to export the output of multiple gcloud queries into adjacent sheets within one CSV file, using Bash?

I have the following 3 gcloud queries:
Query 1 - To enumerate users of a project:
gcloud projects get-iam-policy MyProject --format="csv(bindings.members)" >> output1.csv
Query 2 - To enumerate users of a folder:
gcloud resource-manager folders get-iam-policy MyFolder --format="csv(bindings.members)" >> output2.csv
Query 3 - To enumerate users of the organization:
gcloud organizations get-iam-policy MyOrg --format="csv(bindings.members)" >> output3.csv
My goal is to run all 3 queries together and export the output in multiple adjacent sheets within one CSV file, instead of 3 separate CSV files. Is that possible?
Please advise. Thanks.
It is not possible.
comma-delimited files (CSVs) do not support multiple 'tables' within a single file.
You must create a file per table.

Dependency and condition oder in azure DevOps Pipeline

In Azure pipeline yaml file, when defining multiple jobs in a single stage, one can specify dependencies between them. One can also specify the conditions under which each job runs.
Code #1
jobs:
- job: A
steps:
- script: echo hello
- job: B
dependsOn: A
condition: and(succeeded(), eq(variables['build.sourceBranch'], 'refs/heads/master'))
steps:
- script: echo this only runs for master
Code #2
jobs:
- job: A
steps:
- script: "echo ##vso[task.setvariable variable=skipsubsequent;isOutput=true]false"
name: printvar
- job: B
condition: and(succeeded(), ne(dependencies.A.outputs['printvar.skipsubsequent'], 'true'))
dependsOn: A
steps:
- script: echo hello from B
Question:
Code #1 & #2 above have different orders of the dependency and condition. Does the order matters? If so, what's matter? (what's the difference between different orders)
Discuss 1 and 2 separately.
Code #1:
Since there is no data connection between job1 and job2, data connection here refers to variable sharing and etc.
So, for #1, there's no matters on order. Here you can ignore the dependsOn specified while you have no special requirements on the execution order between job A and job B.
BUT, there's one key thing you need pay attention is, the actual running order will be changed randomly when you do not specify the dependsOn. For example, most of time, they will respect with the order job A, job B. Occasionally, they will randomly run as job B, job A.
Code #2:
This must make the dependsOn specified. Because your job B is using the output variable which created/generated at job A. Since our system allow same variables name exists in different jobs, you must specify the dependsOn so that the system can know the job B should find the variable skipsubsequent from job A not others. Only this key words specified, the variables which generated in job A can be exposed and available to next jobs.
So, the nutshell is once there is any data connection between jobs, e.g variable pass, you must specify dependsOn to make the jobs has connection with each other.

Writing Spark dataframe as parquet to S3 without creating a _temporary folder

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like
dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)
This works without problems. But then I try to write the data
dataS3.write.parquet("s3a://" + s3_bucket_out)
I do get the following exception
py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary
It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. Can this be prevent somehow, so that spark is writing directly to the given output bucket?
You can't eliminate the _temporary file as that's used to keep the intermediate
work of a query hidden until it's complete
But that's OK, as this isn't the problem. The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see)
You need to write to a subdirectory under a bucket, with a full prefix. e.g.
s3a://mybucket/work/out .
I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %". Because ls has delayed consistency on S3, it can miss newly created files, so not copy them.
See: Improving Apache Spark for the details.
Right now, you can only reliably commit to s3a by writing to HDFS and then copying. EMR s3 works around this by using DynamoDB to offer a consistent listing
I had the same issue when writing the root of S3 bucket:
df.save("s3://bucketname")
I resolved it by adding a / after the bucket name:
df.save("s3://bucketname/")

Run Ansible playbook on UNIQUE user/host combination

I've been trying to implement Ansible in our team to manage different kinds of application things such as configuration files for products and applications, the distribution of maintenance scripts, ...
We don't like to work with "hostnames" in our team because we have 300+ of them with meaningless names. Therefor, I started out creating aliases for them in the Ansible hosts file like:
[bpm-i]
bpm-app1-i1 ansible_user=bpmadmin ansible_host=el1001.bc
bpm-app1-i2 ansible_user=bpmadmin ansible_host=el1003.bc
[bpm-u]
bpm-app1-u1 ansible_user=bpmadmin ansible_host=el2001.bc
bpm-app1-u2 ansible_user=bpmadmin ansible_host=el2003.bc
[bpm-all:children]
bpm-i
bpm-u
Meaning we have a BPM application named "app1" and it's deployed on two hosts in integration-testing and on two hosts in user-acceptance-testing. So far so good. Now I can run an Ansible playbook to (for example) setup the SSH accesses (authorized_keys) for team members or push a maintenance script. I can run those PBs on each host seperately, on all hosts ITT or UAT or even everywhere.
But, typically, we'll have install the same application app1 again on an existing host but with a different purpose - say "training" environment. My reflex would be to do this:
[bpm-i]
bpm-app1-i1 ansible_user=bpmadmin ansible_host=el1001.bc
bpm-app1-i2 ansible_user=bpmadmin ansible_host=el1003.bc
[bpm-u]
bpm-app1-u1 ansible_user=bpmadmin ansible_host=el2001.bc
bpm-app1-u2 ansible_user=bpmadmin ansible_host=el2003.bc
[bpm-t]
bpm-app1-t1 ansible_user=bpmadmin ansible_host=el2001.bc
bpm-app1-t2 ansible_user=bpmadmin ansible_host=el2003.bc
[bpm-all:children]
bpm-i
bpm-u
bpm-t
But ... running PB's becomes a mess now and cause errors. Logically I have two alias names to reach the same user/host combination : bpm-app1-u1 and bpm-app1-t1. I don't mind, that's perfectly logical, but if I were to test a new maintenance script, I would first push it to bpm-app1-i1 for testing and when ok, I probably would run the PB against bpm-all. But because of the non-unique user/host combinations for some aliases the PB would run multiple times on the same user/host. Depending on the actions in the PB this may work coincidentally, but it may also fail horribly.
Is there no way to tell Ansible "Run on ALL - UNIQUE user/host combinations" ?
Since most tasks change something on the remote host, you could use Conditionals to check for that change on the host before running.
For example, if your playbook has a task to run a script that creates a file on the remote host, you can add a when clause to "skip the task if file exists" and check for the existence of that file with a stat task before that one.
- Check whether script has run in previous instance by looking for file
stat: path=/path/to/something
register: something
- name: Run Script when file above does not exist
command: bash myscript.sh
when: not something.exists

How to make Hadoop Distcp copy custom list of folders?

I'm looking for efficient way to sync list of directories from one Hadoop filesytem to another with same directory structure.
For example lets say HDFS1 is official source where data is created and once a week we need to copy newly created data under all data-2 directories to HDFS2:
**HDFS1**
hdfs://namenode1:port/repo/area-1/data-1
hdfs://namenode1:port/repo/area-1/data-2
hdfs://namenode1:port/repo/area-1/data-3
hdfs://namenode1:port/repo/area-2/data-1
hdfs://namenode1:port/repo/area-2/data-2
hdfs://namenode1:port/repo/area-3/data-1
**HDFS2** (subset of HDFS1 - only data-2)
hdfs://namenode2:port/repo/area-1/dir2
hdfs://namenode2:port/repo/area-2/dir2
In this case we have 2 directories to sync:
/repo/area-1/data-2
/repo/area-1/data-2
This can be done by:
hadoop distcp hdfs://namenode1:port/repo/area-1/data-2 hdfs://namenode2:port/repo/area-1
hadoop distcp hdfs://namenode1:port/repo/area-2/data-2 hdfs://namenode2:port/repo/area-2
This will run 2 Hadoop jobs, and if number of directories is big, let's say 500 different non overlapping directories under hdfs://namenode1:port/ - this will create 500 Hadoop jobs which is obvious overkill.
Is there a way to inject custom directory list into distcp?
How to make distcp create one job copying all paths in custom list of directories?
Not sure if this answers the problem, but I noticed you haven't used the "update" operator. The "-update" operator will only copy over the difference in the blocks between the two file systems...

Resources