I have a log file by the name log.json.
A simple insert in rethinkdb works perfectly.
Now this json file get updated every second, how to make sure that the rethinkdb gets the new data automatically, is there a way to achieve this, or i have to simply use the API and insert into db as well as log in a file if i want to.
Thanks.
The process that appends new entries in your json file should probably run query to insert the same entries in RethinkDB.
Or you can have a cron job that
get the last entry saved from rethinkdb
read your json file for new entries
insert new entries
Related
I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.
First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.
I am new to using Apache NiFi, and I trying to create a template that takes a JSON file and turns it into a set of SQL insert statements.
So far I have created a template that takes the JSON file and I have got it to the point of PutSQL. There is no database to connect to at the moment, but what I have not been able to check is the output. Can this be done? What I need to check is whether the array of JSON has been turned into a INSERT per element in the array
As far as inpecting the output, what does your flow look like? If you have something like ConvertJSONToSQL -> PutSQL, you can leave PutSQL stopped and run ConvertJSONToSQL, then you will see FlowFile(s) in the connection between the two processors. Then you can right-click on the connection and choose List Queue, then click the "eye" icon on the right for the FlowFile you wish to inspect. That will show you the contents of the FlowFile right before it goes into PutSQL.
Having said all that, if your JSON file contains fields that correspond to columns in your database, consider PutDatabaseRecord instead of ConvertJSONToSQL -> PutSQL. That can use a JsonTreeReader to parse each record, and it will generate and execute the necessary SQL as a prepared statement using the values in all records of the FlowFile. That way you don't need to generate the SQL yourself or worry about fragmented transactions or any of that.
NiFi version 1.5
i have a csv file arrives first time like:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
used listfile -> fetchfile to get the csv file.
next 10 minutes, i get appended csv file:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
2019/02/04 02:00,86787,99.5,0
2019/02/04 03:00,86117,91.5,0
here, how do we need to get only new records alone (last two records). i do not want to process first two records that is already been processed.
my thought process is, we need to get maximum datetime to store in attribute and use QueryRecord. but i do not know how to get maximum datetime using which processor.
is there any better solution.
This is currently an open issue (NIFI-6047) but there has been a community contribution to address it, so you may see the DetectDuplicateRecord processor in an upcoming release of NiFi.
There may be a workaround to split up the CSV rows and create a compound key using ExtractText, then using DetectDuplicate.
It doesn't seems to be a work that is best solved on Nifi as you need to keep a state of what you have processed. An alternative would be for you to delete what you have already processed. Then you can assume what is in the file is always not processed.
here, how do we need to get only new records alone (last two records).
i do not want to process first two records that is already been
processed.
From my understanding, actual question is 'how to process/ingest csv rows as it is written to the file?'.
Description of 'TailFile' processor from NiFi documentation:
"Tails" a file, or a list of files, ingesting data from the file as it
is written to the file. The file is expected to be textual. Data is
ingested only when a new line is encountered (carriage return or
new-line character or combination)
This solution is appropriate when you don't want to move/delete actual file.
I have been experimenting with Nifi (via Hortonworks HDF) and can't grasp something that seems basic. I start with a file..I want to filter records based on a value (e.g., field number 2 contains "xyz")...then write only the matching records to HDFS.
I have a non-filtered flow working but can't find any docs or examples showing how to apply a schema (or in any way understand the "content") of the file.
What am I missing?
(I have seen references to FlowFiles being "schema-less" - not sure how that is useful).
For example, a file like this:
[app1][field1][field2]
[app2][field1][field2]
[app2][field1][field2][field3]
[app3][field1]
I want to filter to select records containing "[app2]" and create an HDFS file that looks like this:
field1,field2
field1,field2,field3
I have a requirement to write multiple files using Spring Batch. The first file will be written based on the data from the database table. The second file will contain just the number of records written to the first file. How can I create the second file? I am not sure whether org.springframework.batch.item.file.MultiResourceItemWriter is an option for me as I think it will write multiple files based on the data it will write chunks of data in the multiple files. Correct me if I am wrong here.
Please do suggest some options with sample code if possible.
You have couple of options:
You can use CompositeItemWriter which calls collection of item writers in defined order so you can define one item writer which will write records based on data from DB and second will count records and write that to another file.
You can write data to a file in first step, finish whole file and save it somewhere, you can save counter of records if that is all you need to StepContext (common batch patterns and scroll to 11.8 Passing Data to Future Steps) and read in new Taskletcounter and save to new file.
If you want to go with option 1 which I think is right choice you can check this example of batch job configuration with CompositeItemWriter