I am having two spring batch jobs.
1)Fetch data from DB and create CSV file of the fetched data.(Runs in every 15 mins)
2)Index the created CSV file to solr.(Runs in every 10 mins)
These 2 jobs are running in parallel as cron.I want to add partitioner to the csv to solr job for fast processing.
The problem is that some time the csv to solr job picks up the csv which is still in process of getting data from DB to CSV job.So want to pick up only that CSVs which is fully loaded with data and provide that file to partitioner.
Please tell me way forward to it.
Related
I have 2 different file with different data. The file contains 10K record per day.
Ex:
Productname price date
T shirt,500,051221
Pant,1000,051221
Productname price date
T shirt,800,061221
Pant,1800,061221
I want to create final output file by checking price difference by todays and yesterdays file.
Ex:
Productname price
T shirt,300
Pant,800
By using spring batch I have to do this.
I have tried with batch configuration by creating two different step. but its only able to read the data. but unable to
do the processing. because here I need the data of both file for processing. but in my case its reading one step after another.
Could anyone help me on this with some sample code.
I would suggest to save FlatFile data into the database for yesterday's and today's date (may be two separate tables or in a same table if you can identify difference two records easily). Read this stored data using JdbcCursorItemReader or PagingItemReader and perform calculation/logic/massaging of data at the processor level and create a new FlatFile or save into DB as per convenience. OOTB Spring Batch does not provide facility to read data and perform calculation.
Suggestion - Read data from both the FlatFile keep it in cache and read from the cache and do the further processing.
I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.
I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.
We have a use case where we receive data in flat files which we load into an Oracle DB using Spring Batch. Post data load in Oracle, we have to distribute the data in form of flat files to several consumers. The data selection criteria depends on some pre-decided values in some fields of the data.
We have a design in place which generates a list which contains objects that can be passed to a Spring Batch job as job parameter to generate the flat files needed to be sent to the data consumers.
Using a Splitter component, I can put the individual objects into a channel and plug a JobLaunchingGateway to launch a batch job to generate the flat file.
Need help on how I can launch multiple batch jobs in parallel using JobLaunchingGateway so that I can generate files in parallel.
A setup is already in place to FTP the files to consumers. We do not need to worry about that.
Use an ExecutorChannel with a task executor before the JobLaunchingGateway.
Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)