I need an airflow dag to do something like this
download_file >> parse_file >> enrich_file
But if say while parsing i find "bad" records, i need to store it in a db, have the user correct it and then automatically trigger a dag to proceeed to enrich_file for those corrected records. for the "good" records i want the original instance of the dag to complete. Is this possible to do this kind of forking in airflow. is this even the correct tool for this use case or should i stick to messaging and queues ?
Related
My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.
Following 1 and 2:
Different types of files enter my NFS directory from time to time. I would like to use OOZIE or any other HDFS solution to trigger the file arrival event and to copy the file into specific location at the HDFS in accordance to its type. What is the best way to do it?
Best way is very subjective term. It largely depends on, what kind of data, frequency and what sorts of things should happen once the data arrive at specific location.
Apache flume can monitor specific folder for data availability and push it down to any sink like HDFS as-is. Flume is good for streaming data.But it does only one specific job- just moving data from place to place.
But on other hand, look up Oozie Coordinators. Coordinators have data availability trigger and with oozie you can perform all sort of ETL operations after data arrives using tools like spark,hive,pig etc and push it down to hdfs using shell actions. You can schedule jobs to run during specific times,frequency or have job send you an email if something goes wrong...
I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.
I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.
For example if first job did something like SortList(A) and some other job did SortList(A)+Group(result(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).
So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.
Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.
Here's a tutorial I found:
http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.
I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
Process Apache httpd logs into a unified format.
Process Tomcat logs into a unified format.
Join the output of 1 and 2 using whatever logic makes sense, writing the result into the same format.
Export the resulting dataset to your database.
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.