Run TPCC on local cockroachdb cluster - cockroachdb

I have a local cockroachdb up and running by following instructions from https://www.cockroachlabs.com/docs/stable/start-a-local-cluster.html
I am trying to run the tpcc benchmark following the instructions from https://www.cockroachlabs.com/docs/stable/performance-benchmarking-with-tpc-c.html
It looks like the TPCC binary workload.LATEST assumes the cluster is on google cloud; and so it issues the following error:
$ ./workload.LATEST fixtures load tpcc --warehouses=1000 "postgres://root#localhost:26257?sslmode=disable"
Error: failed to create google cloud client (You may need to setup the GCS application default credentials: 'gcloud auth application-default login --project=cockroach-shared'): dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
What can I change to run the benchmark?

If you upgrade to v2.1, workload is a built-in command and you can run it with your cluster, it does not make google cloud assumption: https://www.cockroachlabs.com/docs/stable/cockroach-workload.html

It's not nearly as fast as using the fixtures stored in Google Cloud, but you can load the data into your cluster using normal SQL statements by running something like:
workload init tpcc --warehouses=1000
Note that while I'm not sure exactly how long it will take to load 1000 warehouses in this way locally, I expect it will take quite some time.

Related

How to run DBT in AWS Lambda?

I have currently dockerized my DBT solution and I launch it in AWS Fargate (triggered from Airflow). However, Fargate requires about 1 minute to start running (image pull + resource provisioning + etc.), which is great for long running executions (hours), but not for short ones (1-5 minutes).
I'm trying to run my docker container in AWS Lambda instead of in AWS Fargate for short executions, but I encountered several problems during this migration.
The one I cannot fix is related to the bellow message, at the time of running the dbt deps --profiles-dir . && dbt run -t my_target --profiles-dir . --select my_model
Running with dbt=0.21.0
Encountered an error:
[Errno 38] Function not implemented
It says there is no function implemented but I cannot see anywhere which is that function. As it appears at the time of installing dbt packages (redshift and dbt_utils), I tried to download them and include them in the docker image (set local paths in packages.yml), but nothing changed. Moreover, DBT writes no logs at this phase (I set the log-path to /tmp in the dbt_project.yml so that it can have write permissions within the Lambda), so I'm blind.
Digging into this problem, I've found that this can be related to multiprocessing issues within AWS Lamba (my docker image contains python scripts), as stated in https://github.com/dbt-labs/dbt-core/issues/2992. I run DBT from python using the subprocess library.
Since it may be a multiprocessing issue, I have also tried to set "threads": 1 in profiles.yml but it did not solve the problem.
Does anyone succeeded in deploying DBT in AWS Lambda?
I've recently been trying to do this, and the summary of what I've found is that it seems to be possible, but isn't worth it.
You can pretty easily build a Lambda Layer that includes dbt & the provider you want to use, but you'll also need to patch the multiprocessing behavior and invoke dbt.main from within the Lambda code. Once you've jumped through all those hops, you're left with a dbt instance that is limited to a relatively small upper bound on memory, a 15 minute maximum runtime, and is throttled to a single thread.
This discussion gives an rough example of what's needed to get it running in Lambda: https://github.com/dbt-labs/dbt-core/issues/2992#issuecomment-919288906
All that said, I'd love to put dbt on a Lambda and I hope dbt's multiprocessing will one day support it.

How to run a K6 script locally and send data to remote InfluxDB instance (No Docker)

I'm extremely new at k6 + influxdb + grafana, and I was given a task related to execute certain K6 Scripts locally but save/pass the data over a remote InfluxDB instance.
As of now I'm having issues given that I'm not sure what I'm missing regarding the needed configurations in order to do this since everytime I try to run the script pointing at the InfluxDB instance I'm just getting an error everytime I run it:
The command that I'm executing is:
k6 run --out influxdb="https://my_influxdb_url/write" //sampleScript.js
But the original URL that was handed over to me was something like this:
https://my_influxdb_url/write?db=DB_NAME&u=USERNAME&p=PASSWORD
And when I execute the first mentioned script I'm getting the following error:
ERRO[000X] Couldn't write stats error="404 page not found\n" output=InfluxDB1
So I've tried creating K6_INFLUXDB_USERNAME and K6_INFLUXDB_PASSWORD as environment variables but I'm still getting the same error.
I'm not sure if I might be missing some .yaml file like a datasource in which I should fill those 3 values? (DB_NAME, USERNAME, PASSWORD)
Or maybe I'm just doing it all wrong and not calling the execution command properly for this scenario.
Another weird thing that I noticed is that OUTPUT is throwing InfluxDB1 instead of my actual InfluxDB url which I guess might be where my issue lies.
Any kind of tip would be greatly appreciated since the actual documentation that I've found so far is always run either on a Docker container instance of Grafana+InfluxDB or simply running it locally which is not my case :(
Thanks a lot in advance as always!!

Serverless Detect Running Locally

I am running a command like the following.
serverless invoke local --function twilio_incoming_call
When I run locally in my code I plan to detect this and instead of looking for POST variables look for a MOCK file I'll be giving it.
I don't know how to detect if I'm running serverless with this local command however.
How do you do this?
I looked around on the serverless website and could find lots of info about running in local but not detecting if you were in local.
I found out the answer. process.env.IS_LOCAL will detect if you are running locally. Missed this on their website somehow...
If you're using AWS Lambda, it has some built-in environment variables. In the absence of those variables, then you can conclude that your function is running locally.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-environment-variables.html
const isRunningLocally = !process.env.AWS_EXECUTION_ENV
This method works regardless of the framework you use whether you are using serverless, Apex UP, AWS SAM, etc.
You can also check what is in process.argv:
process.argv[1] will equal '/usr/local/bin/sls'
process.argv[2] will equal 'invoke'
process.argv[3] will equal 'local'

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

How to connect to Cassandra using ruby

I am new to Cassandra and was trying to achieve some simple
operations like inserting data into cassandra. I am using cassandra gem
to achieve this.
client = Cassandra.new('tags_logs', 'ec2-xxx-xxx-xxx.com:9160')
client.disable_node_auto_discovery!
client.get('tag_data','red')
And I get the following error:
ThriftClient::NoServersAvailable - No live servers in ...
I'm running this code from my local machine. And while I've no problem connecting using cassandra-cli (so it is not a firewall issue), the code refuses to work. It works perfectly when accessing Cassandra on my own local machine.
Any ideas?
Thanks,
Eden.
I recommend you to use this gem I'm developing: https://github.com/hsgubert/cassandra_migrations
It gives access to Cassandra through CQL3 and manages schema with migrations.
Note: it requires Rails.
For future generations: simply change the timeout ...
client = Cassandra.new('tags_logs', 'ec2-example-example-example.com:9160',:connect_timeout => 10000)

Resources