Spring XD Batch Job Incremental - spring-xd

I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.

I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.

Related

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

read data through spring batch and return data outside the job

I read everywhere how to read data in spring batch itemReader and write in database using itemWriter, but I wanted to just read data using spring batch then somehow I wanted to access this list of items outside the job. I need to perform remaining processing after job finished.
The reason I wanted to do this is because I need to perform a lot of validations on every item. I have to validate each item's variable xyz if it exists in list(which is not available within job). After performing a lot of processing I have to insert information in different tables using JPA. Please help me out!

Would spark dataframe read from external source on every action?

On a spark shell I use the below code to read from a csv file
val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()
Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?
Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session
After each action spark forgets about the loaded data and any intermediate variables value you used in between.
So, if you invoke 4 actions one after another, it computes everything from beginning each time.
Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.
That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.
TL;DR DataFrame is not different than RDD. You can expect that the same rules apply.
With simple plan like this the answer is yes. It will read data for every show although, if action doesn't require all data (like here0 it won't read complete file.
In general case (complex execution plans) data can accessed from the shuffle files.

RoR schedule job

I want to run some cron schedule task in my app. I have figured out the code, but I'm not sure where to put it. Should I put all code into job file? The code performs such tasks:
Retrieve XML data from 3rd party websites.
Do some filtering and add data to array.
Push data from array to database.
Code is performed every hour.
You can create some kind of a parser for XML data retrieval, and store this class into the lib directory. Other stuff like filtering and pushing data from database - you can implement this directly in the job. As the job grows, you will see what can be moved elsewhere or extended.

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Resources