How to use airflow for real time data processing - etl

I have a scenario where i want to process csv file and load to someother database:
Cases
pic csv file and load to mysql with the same name as csv
then do some modification on loaded rows using python task file
after that extract data from mysql and load to some other database
CSV files are coming from remote server to one airflow server in a folder.
We have to pick these csv file and process through python script.
Suppose i pick one csv file then i need to pass this csv file to rest of the operator in a dependency manner like
filename : abc.csv
task1 >> task2 >> task3 >>task4
So abc.csv should be available for all the task.
Please tell how to proceed.

Your scenarios don't have anything to do with realtime. This is ingesting on a schedule/interval. Or perhaps you could use a SensorTask Operator t detect data availability.
Implement each of your requirements as functions and call them from operator instances.
Add the operators to a DAG with a schedule appropriate for your incoming feed.
How you pass and access params is
-kwargs python_callable when initing an operator
-context['param_key'] in execute method when extending an operator
-jinja templates
relevant...
airflow pass parameter from cli
execution_date in airflow: need to access as a variable

The way tasks communicate in Airflow is using XCOM, but it is meant for small values, not for file content.
If you want your tasks to work with the same csv file you should save it on some location and then pass in the XCOM the path to this location.
We are using the LocalExecutor, so the local file system is fine for us.
We decided to create a folder for each dag with the name of the dag. Inside that folder we generate a folder for each execution date (we do this in the first task, that we always call start_task). Then we pass the path of this folder to the subsequent tasks via Xcom.
Example code for the start_task:
def start(share_path, **context):
execution_date_as_string = context['execution_date'].strftime(DATE_FORMAT)
execution_folder_path = os.path.join(share_path, 'my_dag_name', execution_date_as_string)
_create_folder_delete_if_exists(execution_folder_path)
task_instance = context['task_instance']
task_instance.xcom_push(key="execution_folder_path", value=execution_folder_path)
start_task = PythonOperator(
task_id='start_task',
provide_context=True,
python_callable=start,
op_args=[share_path],
dag=dag
)
The share_path is the base directory for all dags, we keep it in the Airflow variables.
Subsequent tasks can get the execution folder with:
execution_folder_path = task_instance.xcom_pull(task_ids='start_task', key='execution_folder_path')

Related

Can we rename the file name in Jmeter during run time which will be uploaded in the script

I have to upload audio files in the Jmeter script which is stored in my system. E.g. abc.wav is the file store in the system. But in the script the file name format should be "Testinstanceid__itemid__ interactionid.wav". Here "Testinstanceid" is the dynamic value which we can get from correlation of previous response.
But how I can upload the file with this dynamic value during run time and it will upload correctly in the script.
Thanks in advance
You can copy your abc.wav file to the Testinstanceid__itemid__ interactionid.wav file using JSR223 PreProcessor and Groovy script like:
org.apache.commons.io.FileUtils.copyFile(new File('abc.wav'), new File(vars.get('Testinstanceid') + '__itemid__ interactionid.wav'))
once you finish the uploading you can delete the file to free up your drive in JSR223 PostProcessor like:
org.apache.commons.io.FileUtils.deleteQuietly(new File(vars.get('Testinstanceid') + '__itemid__ interactionid.wav'))
where vars stands for JMeterVariables class instance, check out the JavaDoc for all available functions

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Oozie generate set of files in directory

I'm trying to ingest log files into hadoop.
I'd like to use oozie to trigger my ingestion task (written in spark),and have oozie pass the filenames to my task.
I expect the log files to be set out as:
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.2.log
(etc).
So, now I have two problems:
1. How to get oozie to generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/ and pass it to my app; and
How to get oozie to in parallel generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/ and pass it to a second invocation of my task.
DateTime wise File name create can be done by using small Java Program, which can be call from Oozie Workflow.xml,
somthing like
String processedDateString = (new SimpleDateFormat("yyyyMMddhhmmss")).format(new Date(timeInMilis));
and while calling the same jar in workflow
<main-class>NameFile.jar</main-class>
<arg>Path=${output_path}</arg>
<arg>Name=${name}</arg>
<arg>processedDate=${(wf:actionData('Rename')['ProcessedDate'])}</arg>
For Copying/Moving you can use same Java program with Copy Action.
for log1 and log2 location you can mention in job.properties

Writing to a file named with variables in distributed JMeter testing

Okay I've been having an issue with writing results to folders in JMeter.
I have set 2 variables, one for the name of the test and one for the submit date. I want the reports to be written to the folder named with these two variables.
Here's the variables:
TestRun = "Name of test"
DateRun = $__{time(dd-MMM-yyyy HH.mm.ss)}
The path of the folder to be written to looks like this:
C:\Tests\TestEnvironment\Results\\${TestRun}${DateRun}\file.csv
When I run it on the master machine, it's fine. It saves to the correct file and folder path, and ends up looking something like this:
C:\Tests\TestEnvironment\Results\Test Run 1 - 08-May-2014 08.55.47\file.csv
However, when I run it on remote machines, it saves it literally as below:
C:\Tests\TestEnvironment\Results\${TestRun}${DateRun}\file.csv
So I end up with a folder named "${TestRun}${DateRun}"
Am I missing something blindingly obvious, or is this an actual JMeter issue?
Thanks!
As per JMeter help:
-G, --globalproperty <argument>=<value>
Define Global properties (sent to servers)
e.g. -Gport=123
or -Gglobal.properties
You need to use -G key so your variables could be distributed across remote clients.
so something like:
jmeter -r -n GTestRun=SomeName -GDateRun=SomeTime -t /path/to/your/plan
should help.
Alternatively you can create a .properties file and pass it to remote JMeter Engines via the same "-G" option.
I expect that if you want to use JMeter __time() function you'll need to wrap it with __eval, elsewise it will be treated as a string. Alternatively you can use operating system commands to retrieve current date and time.
See Apache JMeter Properties Customization Guide for more information on dealing with JMeter Properties.

Resources