Apache Nifi: Pipe files using GetFile into ExecuteProcess - apache-nifi

I have a python script that takes in command line arguments to decrypt a file. The python command to be executed looks like this:
python decrypt.py -f "file_to_decrypt.enc" -k "private_key_file.txt"
I am trying to pick those files up using the GetFile processor in NiFi which does the job of picking them since I can see the filenames in the logs.
On the other hand, I have a ExecuteProcess process setup to run the python script as mentioned above. However I will need the filenames to be passed into the ExecuteProcessfor the Python script to work. So my question is, how do I pipe the files from GetFileprocess into the ExecuteProcess process in Apache NiFi?

You can use the ExecuteStreamCommand processor instead of ExecuteProcess. This processor accepts an incoming flowfile and can access attributes and content, whereas ExecuteProcess is a source processor and doesn't accept incoming flowfiles.
I don't know if you need GetFile (gets the content of the files); try ListFile and RouteOnAttribute to filter the two filenames you want. Merge the two successful listings into one flowfile with MergeContent, then use the ${filename} attributes and expression language to populate the command arguments with x.enc and y.txt.
Update
I built a template that performs the following tasks:
Generates the example key file (not a valid key)
Generates the example encrypted data file (not valid cipher text)
Uses ListFile, UpdateAttribute, RouteOnAttribute, MergeContent, and ExecuteStreamCommand to perform the command-line Python decryption (mocked by echo)
Note, this uses an expression language function ifElse() which is currently in NiFi master but is not yet released. It is part of the 1.2.0 release, but if you build from master, you can use it now.
I still think EncryptContent or especially ExecuteScript is more compact, but this works.

Related

How to use airflow for real time data processing

I have a scenario where i want to process csv file and load to someother database:
Cases
pic csv file and load to mysql with the same name as csv
then do some modification on loaded rows using python task file
after that extract data from mysql and load to some other database
CSV files are coming from remote server to one airflow server in a folder.
We have to pick these csv file and process through python script.
Suppose i pick one csv file then i need to pass this csv file to rest of the operator in a dependency manner like
filename : abc.csv
task1 >> task2 >> task3 >>task4
So abc.csv should be available for all the task.
Please tell how to proceed.
Your scenarios don't have anything to do with realtime. This is ingesting on a schedule/interval. Or perhaps you could use a SensorTask Operator t detect data availability.
Implement each of your requirements as functions and call them from operator instances.
Add the operators to a DAG with a schedule appropriate for your incoming feed.
How you pass and access params is
-kwargs python_callable when initing an operator
-context['param_key'] in execute method when extending an operator
-jinja templates
relevant...
airflow pass parameter from cli
execution_date in airflow: need to access as a variable
The way tasks communicate in Airflow is using XCOM, but it is meant for small values, not for file content.
If you want your tasks to work with the same csv file you should save it on some location and then pass in the XCOM the path to this location.
We are using the LocalExecutor, so the local file system is fine for us.
We decided to create a folder for each dag with the name of the dag. Inside that folder we generate a folder for each execution date (we do this in the first task, that we always call start_task). Then we pass the path of this folder to the subsequent tasks via Xcom.
Example code for the start_task:
def start(share_path, **context):
execution_date_as_string = context['execution_date'].strftime(DATE_FORMAT)
execution_folder_path = os.path.join(share_path, 'my_dag_name', execution_date_as_string)
_create_folder_delete_if_exists(execution_folder_path)
task_instance = context['task_instance']
task_instance.xcom_push(key="execution_folder_path", value=execution_folder_path)
start_task = PythonOperator(
task_id='start_task',
provide_context=True,
python_callable=start,
op_args=[share_path],
dag=dag
)
The share_path is the base directory for all dags, we keep it in the Airflow variables.
Subsequent tasks can get the execution folder with:
execution_folder_path = task_instance.xcom_pull(task_ids='start_task', key='execution_folder_path')

Apache Nifi decompression

I'm new to Apache NIFI and trying to build a flow as a POC. I need your guidance for the same.
I have a compressed 'gz' file say 'sample.gz' containing a file say 'sample_file'.
I need to decompress the sample.gz file and store 'sample_file' in a hdfs location.
I'm using GetFile processor to get the sample.gz file, CompressContent processor in decompress mode to decompress the same file and PutHDFS processor to put the decompressed file in HDFS location.
After running the flow, I can find that the original sample.gz file is only copied to HDFS location whereas I needed to copy the sample_file inside the gz file. So decompressing has actually not worked for me.
I hope I could explain the issue I'm facing. Please suggest if I need to change my approach.
I used the same sequence of processors but changed PutHDFS to PutFile.
GetFile --> CompressContent(decompress) --> PutFile
In nifi v1.3.0 it works fine.
The only note: if I keep the parameter Update Filename = false the for CompressContent then the filename attribute remains the same after decompression as before (sample.gz).
But the content is decompressed.
So, if your question about the filename then:
you can change by setting parameter Update Filename = true in CompressContent processor. in this case sample.gz will be changed to sample during decompression.
use UpdateAttribute processor to change the filename attribute

Writing to popen and reading back several files in Ruby

I need to run some shell commands on a number of files and sometimes I get back more than one file in response. The question is: How can I read back several files from IO.popen in Ruby?
For instance, imagine the following case:
file = grid.get(record['_id']) # fetch a file from database
IO.popen('tar -Oxmz', 'ab') {|pipe| pipe.write(file.read)} # pass to tar and extract
This necessitates that I reread all the extracted files from the filesystem. I figured out this is the speed bottleneck of my script and I wonder if I can accomplish the same task in-memroy. I tried the following:
file = grid.get(record['_id'])
IO.popen('tar -Oxmz', 'w+b') do |pipe|
pipe.write(file.read)
pipe.close_write
output = pipe.read
end
It works, but I get the whole response, here including several extracted files, in one piece (in variable output). I need the files separate from each other and possibly with their names. Is there any way to do this?
By the way, the resulting files are most of the time text, but sometimes binary. Running a pipe for each output file is not a solution, because the actual overhead of running the commands for each file outweights the benefits of doing the transformation in-memory.
P.S. The actual use case does not rely on tar only. I use software that do not have Ruby wrappers.

Jmeter global properties and Simple Data Writers

I am setting a Jmeter global property on the command line with the -G option. I try to use this property to alter the file name of a Simple Data Writer. However, In the data writer the __P function returns only the default.
jmeter -t ... --nongui ... -GFileName=MyFile.xml ...
So, I know that I am setting the global property correctly. Both the jmeter log and the Jmeter server log show that the value is being captured from the command line. However it still refuses write a file name with anything other than the default.
I use the following command
filename_${__P(FileName,Default.fl)}
How do I pass in a value at the command line so that I can use it as the file name for a Simple Data Writer?
Notes: I am using remote servers, so I must use -G, and I already have a primary data file output, so I cannot use -l .
Why not to use -J or -D directives to set your property?
Everything will work as you want in case of
-JFileName=MyFile.xml
or
-DFileName=MyFile.xml
In both the cases you can than further refer to this property in Simple Data Writer as ${__P(FileName,)}.
Well, I've got the same negative result as your while trying to use global (-G) property but I cannot find in your situation described any prerequisites to use global (-G) properties instead of local (-J) or system (-D) ones.
Global properties are defined to be sent to remote servers... are you executing test in client-server mode (jmeter-server started)?
Than, as per 18.3.9 Simple Data Writer
When running in non-GUI mode, the -l flag can be used to create a data file.
I.e. running
jmeter -n -t ... -l MyFile.xml
will give you the same result in MyFile.xml.
As additional note.
You can try to use JMeterPlugins solutions:
Flexible File Writer - instead of native Simple Data Writer.

Automatically generate conf file during make

I have a conf file that is of the format:
name=value
What I want to do is using a template, generate a result based on some values in another file.
So for example, say I have a file called PATHS that contains
CONF_DIR=/etc
BIN_DIR=/usr/sbin
LOG_DIR=/var/log
CACHE_DIR=/home/cache
This PATHS file gets included into a Makefile so that when I call make install the paths are created and built applications and conf files copied appropriately.
Now I also have a conf file which I want to use as a template.
Say the template contains lines like
LogFile=$(LOG_DIR)/myapp.log
...
Then generate a destination conf that would have
LogFile=/var/log/myapp.log
...
etc
I think this can be done with a sed script, but I'm not very familiar with sed and regular expression syntax. I will accept a shell script version too.
You should definitely go with autoconf here, whose very job is to do this. You'll have to write a conf.in file, wherein all substitutions are marked with #'s, e.g.
prefix=#prefix#
bindir=#bindir#
and write up a configure.ac, which is a shell script that will perform these substitutions for you and create conf. conf is subsequently included in the Makefile. I'd even recommend using a Makefile.in file, i.e. including your snippet in the Makefile.
If you keep to the standard path names, your configure.ac is a four-liner and has the added advantage of being GNU compatible (easy to understand & use).
You may want to consider using m4 as a simple template language instead.

Resources