I want to run some cron schedule task in my app. I have figured out the code, but I'm not sure where to put it. Should I put all code into job file? The code performs such tasks:
Retrieve XML data from 3rd party websites.
Do some filtering and add data to array.
Push data from array to database.
Code is performed every hour.
You can create some kind of a parser for XML data retrieval, and store this class into the lib directory. Other stuff like filtering and pushing data from database - you can implement this directly in the job. As the job grows, you will see what can be moved elsewhere or extended.
Related
I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.
First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.
I read everywhere how to read data in spring batch itemReader and write in database using itemWriter, but I wanted to just read data using spring batch then somehow I wanted to access this list of items outside the job. I need to perform remaining processing after job finished.
The reason I wanted to do this is because I need to perform a lot of validations on every item. I have to validate each item's variable xyz if it exists in list(which is not available within job). After performing a lot of processing I have to insert information in different tables using JPA. Please help me out!
I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.
So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.
How can I customize HbaseStorage for pig script? Actually I want to perform some business logic on the data before loading it to the pig script. It would be something like custom storage on top of HbaseStorage.
e.g I've my row key has structure like this A_B_C. Currently, I'm passing A_B_C key in HbaseStorage in my pig script but I want to perform some logic like filtering etc against key like A_B_C_D before serving input data to actual pig script. How is it possible
You may have to end up looking at the HBaseStorage java class and implementing your own classes based on that. Depending on how the HBaseStorage and associated classes have been written, this could vary from being easy (just extend HBaseStorage itself and overwrite where necessary) to a real headache.
You then have to ensure that the .jar containing your code is on the pig classpath.
I find HbaseStorage to be a real pain, so I write regular Java MR jobs to query HBase and create custom sequence files, which I then use from Pig with a simple custom loader. I find this saves a ton of time since the sequence file can be re-used many times throughout the day to get quick results, rather than scanning everything in Hbase for every Pig script.