Python: Distributed computing on cloud for a python function - parallel-processing

I have a Python function that is as simple as shown below. However, processing it one by one will take far too long.
So I'm considering splitting the input 'list of ids' into multiple lists in order to run it on a cloud parallelly. I believe AWS 'EC2' is one option, but configuring them one by one is too complexed. I'm wondering if there's any simple way to speed up my work?
import many_packages
def myfunc(list_of_ids:List[int])->None:
new_file = run preprocessing_with_pandas()
# this stage needs a java run time and python's subprocess
results = run_a_jar_with(new_file)
upload_results_to_s3()
Expected result:
many_lists = list_of_ids.split_into_chunks(chunk_size=100)
for i,c in enumerate(computers:"Cloud Instance"):
c.computing(myfunc(many_lists[i]))
The current constraint is that I cannot use 'pyspark' because the data must be processed using a library that only supports pandas. So I'm doing research on another framework: 'Dask' to see how feasible it can be done with it.

Related

How can I auto generate inputs.tf and outputs.tf variables when working with Terraform?

Note: Please see the #### UPDATE ### section below. I've heavily modified the question for clarity on what I'm trying to achieve, but added it as an addendum rather than rewrite the question.
As my infrastructure grows, adding input variables in my variables.tf files and then syncing those values to output variables in my outputs.tf file is now impossible to do manually. Not only is it taking up a lot of unnecessary time, probably more time is spent going back and fixing the ones that terraform validate told me that I missed by human error. This is especially true when building / using modules whose arguments add an additional layer to manage.
There has to be a better way? Here is what I want to achieve.
Let's say I'm creating an Azure AKS Kubernetes cluster. The Terraform resource is azurerm_kubernetes_cluster.
Only 8 arguments are required to create a base install, but there are almost 250 additional ones. They all have default values. Per the documentation page, they also already have fantastic descriptions. (I'm tired of copying and pasting into my variables { description = "this"} block.)
The information is there in the documentation. terraform plan also has knowledge of every single additional one because it of course comes up in the pre-apply plan. (known after apply) means its optional, but will have a default value.
In my dream world, I'd run this hypothetical command sequence:
terraform plan
terraform document <- Here it auto generates every argument as a variables block and inserts it into variables.tf. It also auto generates every possible output "out_putable" {} block and inserts it into outputs.tf.
terraform apply -update-inputs -update-outputs <- Here everything that was optional (known after apply) is now known and it should auto update variables.tf and outputs.tf accordingly. Adding a -update-modules flag lets it take care of that additional layer introduced by using modules.
This feels like a problem that has been addressed before. Before I write a custom tool that parses Terraform web docs and the output of terraform show, is there already a way to do this? Terraform-docs is the closest I've come to finding a solution for README.md. If it can do what I need, I haven't figure it out yet.
How can I automate all this?
############
UPDATE
############
This article and video is spot-on when it comes to Terraform's evolution in an organization. My organization is somewhere between late-stage pattern 3 and early 5. As we decompose our "Terralith" we have inconsistencies among teams (patterns, naming conventions, variable and argument choices etc). These are starting to cause errors in CI/CD forcing a ticket-review process that is slowing things down.
All resources have required and optional arguments. But in my organization, we have, for example, additional optional arguments that are required for us.
Scenario: Dev A in Japan creates a resource, forgets an optional variable or two or names them something obscure, etc. Dev B in America is blocked until they can convene and discuss. Given time zones, language differences, ticket review, this one issue is now a week or more delayed.
I need to automate this and create exact consistency so that Dev A starts out with exactly what Dev B would start with or is expecting; and, what CI/CD tests are expecting - templating the initial process, if you will. In other words, I need to remove the human element of manually creating main.tf, variables.tf, outputs.tf, etc.
Here are thoughts on how to achieve this:
Use Golang to autogenerate the files by querying the API
How can I query the API to get a list of all required arguments for a specific resource?
I found that I can query for provider information, but I can't find info to retrieve resource information. My thinking is when a developer wants to create a new resource, He'll run a go or typescript to generate the manifest files along with expected naming conventions, and populate main.tf, variables.tf, outputs.tf, etc, with exactly what data that everyone is expecting. I'm looking form something like curl registry.terraform.io/providers/hashicorp/azurerm/v2.99/resource_group?required=yes This should show me all required arguments along with descriptions and other info I can use straight from the API.
Use CDKTF to generate an HCL manifest.tf file from JSON
How can I use CDKTF to generate an HCL .tf file?
CDKTF is EXACTLY what I'm looking for - except in reverse. HCL is seamlessly compatible with JSON. Running cdktf synth creates ./out/cdk.tf.out I'm so close! How do I turn that file into main.tf?!?
The goal here is to have a master file from which all future manifest files are derived. Whether we use azurerm_kubernetes_cluster 1 time or 1000 times, I know for certain that every argument, every variable name, every desired output is exactly the same. If a chance is needed in our desired structure, it will be updated at the JSON level, and CI/CD can ensure those changes are propagated across instances of its use.
I know that I can use the cdk.out.tf file as a drop in replacement for a module, but I don't want my team members to have to learn typescript or how to read json. If I can create a templatized JSON file containing exactly what I'm expecting users to start with, and if they can run some command like cdktf convert cdk.tf.out --HCL output-file.tf then I've accomplished my goal.
If cdktf synth can create an HCL JSON file, and cdktf convert can take a manifest.tf file and turn it into HCL JSON, can't it do the exact opposite? Turn the HCL JSON file into the human-readable, declarative, manifest.tf file?
Perhaps think of it this way. Terraform has a required file structure for a module if it's to be allowed into the module registry. I'm trying to create a similar required structure for each of the resources our organization uses regardless of when and where it's used.
If your goal is to derive input variables and output values from resource type schemas then Terraform can provide you with the information to do so.
In the working directory of a configuration that already uses the provider whose resource type you want to use, run the following command:
terraform providers schema -json
The result contains a JSON description of all of the resource types available in the providers for the current configuration, and for each one the metadata about its attributes, including the type constraint information and descriptions for each one.
From that you can generate whatever other files you need based on that information.
Note that if you are intending to build modules which export the entire surface area (all inputs and all outputs) of a particular resource type the Terraform documentation explicitly recommends against this, suggesting to just use the resource type directly instead since such a module would often not offer sufficient benefit to outweigh the additional complexity and maintenance overhead it implies:
In principle any combination of resources and other constructs can be factored out into a module, but over-using modules can make your overall Terraform configuration harder to understand and maintain, so we recommend moderation.
A good module should raise the level of abstraction by describing a new concept in your architecture that is constructed from resource types offered by providers.
For example, aws_instance and aws_elb are both resource types belonging to the AWS provider. You might use a module to represent the higher-level concept "HashiCorp Consul cluster running in AWS" which happens to be constructed from these and other AWS provider resources.
We do not recommend writing modules that are just thin wrappers around single other resource types. If you have trouble finding a name for your module that isn't the same as the main resource type inside it, that may be a sign that your module is not creating any new abstraction and so the module is adding unnecessary complexity. Just use the resource type directly in the calling module instead.
I've got the same question and develop a small bash script to create output definitions based on module code
This code required the hcledit tool to extract blocks from hcl code
#!/usr/bin/env bash
set -o pipefail
_hcledit=$(which hcledit)
for tf_file in $(ls *.tf); do
cat $tf_file | $_hcledit block list | while read line; do
block_type="${line%%.*}"
line="${line#*.}"
case $block_type in
locals|output|variable|data) continue; break ;;
module)
output_name=$line
output_description="Module '$output_name' attributes"
output_value="$block_type.$output_name"
;;
resource)
label_kind="${line%.*}"
label_name="${line#*.}"
output_name="${label_kind}_${label_name//[\-]/_}"
output_description="Resource '$label_kind.$label_name' attributes"
output_value="$label_kind.$label_name"
;;
esac
cat <<-EOT
output "$output_name" {
description = "$output_description"
value = $output_value
}
EOT
done
done

How to read sql table on Nifi?

I am trying to create a basic flow on Nifi
read table from sql
process it on python
write back another table in sql
It is simple as it is.
But, I am facing issues when I try to read data on python
As far as I learn I need to use sys.stdin/out.
It only reads and writes as below.
import sys
import pandas as pd
file = pd.read_csv(sys.stdin)
file.to_csv(sys.stdout,index=False)
Below you can find processor properties, but I don't think it is the issue.
QueryDatabaseTableRecord:
ExecuteStreamCommand:
PutDatabaseRecord:
Error Message:
There's a much easier way to do this if you're running 1.12.0 or newer: ScriptedTransformRecord. It's like ExecuteScript except it works on a per-record basis. This is what a simple Groovy script for it looks like:
def fullName = record.getValue("FullName")
def nameParts = fullName.split(/[\s]{1,}/)
record.setValue("FirstName", nameParts[0])
record.setValue("LastName:", nameParts[1])
record
It's a new processor, so there's not that much documentation on it yet aside from the (very good) documentation bundled with it. So samples might be sparse at the moment. If you want to use and run into issues, feel free to join the nifi-users mailing list and asked for more detailed help.

Nvidia Digits accuracy and loss plots data

I trained my model in Nvidia Digits 5 and I would now like to extract the accuracy and loss plots that were generated during training for a report. Is this data saved somewhere so that it would possible to extract the data for these plots so that I could plot it in Python and perhaps ultimately modify the plots to compare different models etc?
The best solution I have found is to either look at the HTML file or to scan the text file caffe_output.log that is produced by Caffe. The text file is usually stored in /var/digits/jobs/insert_your_job_id/ but you can also just run on linux systems:
locate caffe_output.log
Go to your DIGITS job folder and locate your job's subfolder. Inside you'll find a file status.pickle, which is a pickled object containing all your job's information.
You can load it in python like so:
import digits
import pickle
data = pickle.load(open('status.pickle','rb'))
This object is somewhat generic and may contain multiple tasks. For a typical classification task it will likely be just one, but you will still need to access it via data.tasks[0]. From there you can grab the plots:
data.tasks[0].combined_graph_data()
which returns a somewhat convoluted dict (unfortunately - since your network can produce many accuracy/loss outputs, as well as even custom ones). It contains everything you need though - I managed to plot accuracy with:
plt.plot( data.tasks[0].combined_graph_data()['columns'][2][1:] )
but it's likely that you'll have to write a bit of custom code. As always, dir() is your friend.

Can I tell SPSS to run certain syntax lines using a syntax command?

So I was wondering if it was possible to write something up in the syntax which tells the program to run certain command lines. I'm not very good at explaining, so here's an example:
*Total sample frequency.
FREQUENCIES VARIABLES=Age Gender CigDay CO Min_last Day_abs Cigs_Monthly
/ORDER=ANALYSIS.
*6. Next, using the split-file function, perform the frequency analysis for each gender.
* Split file.
SORT CASES BY Gender.
SPLIT FILE LAYERED BY Gender.
*7 Run frequency again.
FREQUENCIES VARIABLES=Age Gender CigDay CO Min_last Day_abs Cigs_Monthly
/ORDER=ANALYSIS.
So, I was wondering whether it was possible to not have to copy/paste the Frequency command and simply include a line of command that told SPSS to re-run the syntax rows 37 to 38 (Which is where the first frequency command written).
A short answer is - no. There is not a command available that would allow to run a specific line of syntax. Certainly you can do it manually by selecting and running the lines you need.
But there are other options available for such tasks when you need to re-run a part of the code several time:
Insert command. Save the code you need to run several time in an external syntax file and insert it when needed in your main syntax file.
Define and End Define commands. Define the code you need to run several time as a macro command and call it when needed in your main syntax file.
I suggest not using INCLUDE as it is obsolete, although it is still supported. INSERT provides better functionality.
If you set out to build a macro library for your frequently used commands, think about parameterizing them so that, for example, you can pass in the specific variables to use as arguments. See the Command Syntax Reference entry for DEFINE via the Help menu for full details, but be prepared to spend some time studying it.

With CRF++, MIRA works for me but CRF-L1 and CRF-L2 do not

It may not matter, but I am using the windows distribution of CRF++ 0.58.
So I have successfully used mallet to train a model with a CRF and then test it. When I try to use the same train and test files with CRF++ (and after creating a template file), I get a
The line search routine mcsrch failed: error code:0
error when I use either
-a CRF-L1
or the default
-a CRF-L2
When I use
-a MIRA
though, training works without error and same with test.
The format of the test and training data can be the same for both mallet and crf++, so that is not the issue. My template file is as simple as
#Mixed
M00:%x[0,0]
M01:%x[0,1]
M02:%x[0,2]
......
M12:%x[0,12]
My last column is either 0 or 1 in my training data which is the value to classify with. No whitespace in any of my features, I use underscores when necessary. Am I missing something simple here, what would cause the L1 and L2 regularizations to fail like that?
I knew it was something silly ...
To use features like I am using, you need to use the U prefix (as in Unigram). So like U00:%x[0,0] is fine. You can't just call you features anything you want.
I also discovered that if I stripped down my test data to a single sentence, I would get the same error message. When I restored my test data back to its original size of around 2600 sentences, the regularization algorithms now run. Overfitting is a common cause of this error message across various nlp and ml applications, but that was not the true problem in my case.
It can also happen in the extreme case of a dataset with just one CLASS (due to bug in the training set generation procedure).

Resources