I am trying to create a basic flow on Nifi
read table from sql
process it on python
write back another table in sql
It is simple as it is.
But, I am facing issues when I try to read data on python
As far as I learn I need to use sys.stdin/out.
It only reads and writes as below.
import sys
import pandas as pd
file = pd.read_csv(sys.stdin)
file.to_csv(sys.stdout,index=False)
Below you can find processor properties, but I don't think it is the issue.
QueryDatabaseTableRecord:
ExecuteStreamCommand:
PutDatabaseRecord:
Error Message:
There's a much easier way to do this if you're running 1.12.0 or newer: ScriptedTransformRecord. It's like ExecuteScript except it works on a per-record basis. This is what a simple Groovy script for it looks like:
def fullName = record.getValue("FullName")
def nameParts = fullName.split(/[\s]{1,}/)
record.setValue("FirstName", nameParts[0])
record.setValue("LastName:", nameParts[1])
record
It's a new processor, so there's not that much documentation on it yet aside from the (very good) documentation bundled with it. So samples might be sparse at the moment. If you want to use and run into issues, feel free to join the nifi-users mailing list and asked for more detailed help.
Related
I have a Python function that is as simple as shown below. However, processing it one by one will take far too long.
So I'm considering splitting the input 'list of ids' into multiple lists in order to run it on a cloud parallelly. I believe AWS 'EC2' is one option, but configuring them one by one is too complexed. I'm wondering if there's any simple way to speed up my work?
import many_packages
def myfunc(list_of_ids:List[int])->None:
new_file = run preprocessing_with_pandas()
# this stage needs a java run time and python's subprocess
results = run_a_jar_with(new_file)
upload_results_to_s3()
Expected result:
many_lists = list_of_ids.split_into_chunks(chunk_size=100)
for i,c in enumerate(computers:"Cloud Instance"):
c.computing(myfunc(many_lists[i]))
The current constraint is that I cannot use 'pyspark' because the data must be processed using a library that only supports pandas. So I'm doing research on another framework: 'Dask' to see how feasible it can be done with it.
As the title suggests, I am just trying to do a simple export of a datastage job. The issue occurs when we export the XML and begin examination. For some reason, the wrong information is being pulled from the job and placed in the XML.
As an example the SQL in a transform of the job may be:
SELECT V1,V2,V3 FROM TABLE_1;
Whereas the XML for the same transform may produce:
SELECT V1,Y6,Y9 FROM TABLE_1,TABLE_2;
It makes no sense to me how the export of a job could be different then the actual architecture.
The parameters I am using to export are:
Exclude Read Only Items: No
Include Dependent Items: Yes
Include Source Code with Routines: Yes
Include Source Code with Job Executable: Yes
Include Source Content with Data Quality Specifications: No
What tool are you using to view the XML? Try using something less smart, such as Notepad or Wordpad. This will determine/eliminate whether the problem is with your XML viewer.
You might also try exporting in DSX format and examining that output, to see whether the same symptoms are visible there.
Thank you all for the feedback. I realized that the issue wasn't necessarily with the XML. It had to do with numerous factors within our data stage environment. As mentioned above, the data connections were old and unreliable. For some reason this does not impact our current production refresh, so it's a non issue.
The other issue was the way that the generated SQL and custom SQL options work when creating the XML. In my case, there were times when old code was kept in the system, but the option was switched from custom code to generate SQL based on columns. This lead to inconsistent output from my script. Thus the mini project was scrapped.
I uses the sentense
He died in the day before yesterday.
to process corenlp NER.
On the server, I got the result like this.
And in local, I uses the same sentence, got the result of
He(O) died(O) in(O) the(O) day(TIME) before(O) yesterday(O) .(O)
So, how can I get the same result like the server?
In order to increase the likelihood of getting a relevant answer, you may want to rephrase your question and provide a bit more information. And as a bonus, in the process of doing so, you may even find out the answer yourself ;)
For example, what url are you using to get your server result? When I check here: http://nlp.stanford.edu:8080/ner/process , I can select multiple models for English. Not sure which version their API is based on (would say the most recent stable version, but I don't know). Then the title of your post suggests you are using 3.8 locally, but it wouldn't hurt to specify the relevant piece in your pom.xml file, or the models you downloaded yourself.
What model are you using in your code? How are you calling it? (i.e. any other annotators in your pipeline that could be relevant for NER output)
Are you even calling it from code (if so, Java? Python?), or using it from the command line?
A lot of this is summarised in https://stackoverflow.com/help/how-to-ask and it's not that long to read through ;)
I am importing data with the new neo4j version (2.1.1) that allows for csv import. The csv import in question is dealing with bigrams.
The csv file looks like this;
$ head ~/filepath/w2.csv
value,w1,w2,
275,a,a
31,a,aaa
29,a,all
45,a,an
I am pasting this into the neo4j-shell client to load in the csv;
neo4j-sh (?)$ USING PERIODIC COMMIT
> LOAD CSV WITH HEADERS FROM "file:/Users/code/Downloads/w2.csv" AS line
> MERGE (w1:Word {value: line.w1})
> MERGE (w2:Word {value: line.w2})
> MERGE (w1)-[:LINK {value: line.value}]->(w2);
The problem is that the shell now hangs and I have no idea what it is doing. I have checked with the interactive online environment and it does not seem to load any data. It seems unlikely that I have not hit a periodic commit moment yet as the shell has been working for half an hour now.
Is there any way for me to get a sign of life from the csv-loader? I would like to see some intermediate results to assist me in debugging what is going on. A solution to my current situation is welcome, but I am specifically interested in a way to debug the csvloader.
I don't have a answer on the debugging/logging part.
Though maybe a hint to make your query a bit faster.
Do you have a index on the :Word.value?
You can try adding a index:
CREATE INDEX ON :Word(value);
UPDATE:
If you want to follow the import process you can follow the disk size of the graph.db directory. It can roughly give you an idea about the progress.
On a unix machine:
du -s ~/neo4j-community-2.1.2/data/graph.db/
I have the OID: .1.3.6.1.2.1.25.3.3.1.2
I got 24 rows (I have 24 core server),
I want to create one graph with all the rows to see the utilization.
Please help me :)
Thanks...
Had the same problem and I created a data input methode in Perl which uses Net::SNMP.
Get the script here:
https://gist.github.com/1139477
Get the data template here:
https://gist.github.com/1237260
Put the script into $CACTI_HOME/scripts, make sure it's executable and import the template.
Make sure you got Perl's Net::SNMP installed.
Have fun!
Alex.