Spring Batch - How to do chunk based processing from the Cache - spring

As per my use case I am extending my questions further based on this questions- Spring Batch With Annotation and Caching.
I am looking to implement the Spring Batch as a Streaming Platform. System-A give me some data from Tab limited file, there are certain query parameters in that data like (field1, field2 and field3) -
Note: There are note the for some rows in flat file field1,field2 and field3 could be null
Case-1: If field1 is present, then call to System-B using REST API and get the data. If there is no data exists.. call using field-2, if there is no data then try with field-3, still no data, mark it as data quality issue.
Case-2: If field1 is null, then use field2 call the System-B get the data, if no data exists call using field2.
Case-3: If field1 and field2 is null, then try with the field3. So these are total combinations.
Once we've the data from system-B each row wise, based on system-B data call System-C, get the data from system-c call system-B
I really wanted to make it stream based and hold the all data into the cache and unless enrichment done completely gather all the data into cache and then do the chunk based processing and save system-A data into Table-A, System-B data into Table-B and System-C data into Table-C and so on then then have a association between them in the Master table etc.
Is there any way if we can do the chunk based processing from the cache?

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

How to use putSQL in apache nifi

I a beginner in data warehousing and apache nifi. I was trying taking the Mysql table data into nifi and then want to put that data into another mysql database table, I am successfully getting data from the database table one and I can also able to print that data into file using putFile processor.
But now I want to store that queued data into mysql database table, I know there is putSQL processor but it was not working for me.
Can anyone let me know how to do it correctly.
Here are the screenshots of my flow
PutSQL configuration-
I converted data from Avro to JSON and then JSON to SQL in case if that would work, but this also not worked.
Use PutDatabaseRecord and remove the Convert* processors.
From nifi docs:
The PutDatabaseRecord processor uses a specified RecordReader to input
(possibly multiple) records from an incoming flow file. These records
are translated to SQL statements and executed as a single transaction.
If any errors occur, the flow file is routed to failure or retry, and
if the records are transmitted successfully, the incoming flow file is
routed to success. The type of statement executed by the processor is
specified via the Statement Type property, which accepts some
hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use
statement.type Attribute', which causes the processor to get the
statement type from a flow file attribute. IMPORTANT: If the Statement
Type is UPDATE, then the incoming records must not alter the value(s)
of the primary keys (or user-specified Update Keys). If such records
are encountered, the UPDATE statement issued to the database may do
nothing (if no existing records with the new primary key values are
found), or could inadvertently corrupt the existing data (by changing
records for which the new values of the primary keys exist).
This should be more performant and cleaner.

How to read and perform batch processing using spring batch annotation config

I have 2 different file with different data. The file contains 10K record per day.
Ex:
Productname price date
T shirt,500,051221
Pant,1000,051221
Productname price date
T shirt,800,061221
Pant,1800,061221
I want to create final output file by checking price difference by todays and yesterdays file.
Ex:
Productname price
T shirt,300
Pant,800
By using spring batch I have to do this.
I have tried with batch configuration by creating two different step. but its only able to read the data. but unable to
do the processing. because here I need the data of both file for processing. but in my case its reading one step after another.
Could anyone help me on this with some sample code.
I would suggest to save FlatFile data into the database for yesterday's and today's date (may be two separate tables or in a same table if you can identify difference two records easily). Read this stored data using JdbcCursorItemReader or PagingItemReader and perform calculation/logic/massaging of data at the processor level and create a new FlatFile or save into DB as per convenience. OOTB Spring Batch does not provide facility to read data and perform calculation.
Suggestion - Read data from both the FlatFile keep it in cache and read from the cache and do the further processing.

Glue Crawler excluding many files from table after running on S3 json GZIP data

I have a lambda that is ingesting json data from a load balancer and then writing each individual json record with a PUT to a kinesis stream. The kinesis stream is the producer for kinesis firehose, which deposits GZIP into S3 bucket under prefix 'raw'. Example JSON record:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking": "true"}}
I created an x-ray trace in the producing lambda so I have an idea of how many PUT request (so each individual JSON record). In the time period I had this ingestion turned "On", I sent about 18,000 records to kinesis stream. When I ran the crawler on the table with prefix "raw" ( I used default settings but checked in "Crawlers Output" section "Update all new and existing partitions with metadata from the table." to avoid the HIVE_PARTITION_SCHEMA_MISMATCH. The crawler runs and successfully detects the schema, and looks like this:
column . data type
level . string
hash string
source . string
hat_name string
event_type string
payload string . <--- (only nested json field that has lots of possible internal structure)
parition_0 string
partition_1 string
partition_2 string
partition_3 string
Once the table is created I notice that there are only about 4,000 records, and it should have about 4 times the amount of records. Later I reran the crawler and I noticed in the logs that one line says:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler
I examined some of the files excluded, the majority of them had valid JSON data, however one or two the file had truncated json record at the end of the file like so:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking":
"true"}}{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payl
What do I need to do in glue to have all records loaded into the table, i should have around 18,000 not 4,200? I think one issue is the schema may not match exaclty on some records? But I validate in the kinesis producer that it is a valid json strucutre with appropriate top level fields. The second issue I see is the file with truncated json record? I am assuming this may be an issue with firehose batching the files? Any help is appreciated.
Note: I have tried to manually create the json table defining all top level fields, and I still have the same problem, It only finds around 4,200 entries when I query in athena.

Best approach to determine Oracle INSERT or UPDATE using NiFi

I have a JSON flow-file and I need determine if I should be doing an INSERT or UPDATE. The trick is to only update the columns that match the JSON attributes. I have an ExecuteSQL working and it returns executesql.row.count, however I've lose the original JSON flow-file which I was planing to use as a routeonattribute. I'm trying to get the MergeContent to join the ExecuteSQL (dump the Avro output, I only need the executesql.row.count attribute) with the JSON flow. I've set follow before I do the ExecuteSQL:
fragment.count=2
fragment.identifier=${UUID()}
fragment.index=${nextInt()}
Alternatively I could create a MERGE, if there is a way to loop through the list of JSON attributes that match the Oracle table?
How large is your JSON? If it's small, you might consider using ExtractText (matching the whole document) to get the JSON into an attribute. Then you can run ExecuteSQL, then ReplaceText to put the JSON back into the content (overwriting the Avro results). If your JSON is large, you could set up a DistributedMapCacheServer and (in a separate flow) run ExecuteSQL and store the value or executesql.row.count into the cache. Then in the JSON flow you can use FetchDistributedMapCache with the "Put Cache Value In Attribute" property set.
If you only need the JSON to use RouteOnAttribute, perhaps you could use EvaluateJsonPath before ExecuteSQL, so your conditions are already in attributes and you can replace the flow file contents.
If you want to use MergeContent, you can set fragment.count to 2, but rather than using the UUID() function, you could set "parent.identifier" to "${uuid}" using UpdateAttribute, then DuplicateFlowFile to create 2 copies, then UpdateAttribute to set "fragment.identifier" to "${parent.identifier}" and "fragment.index" to "${nextInt():mod(2)}". This gives a mergeable set of two flow files, you can route on fragment.index being 0 or 1, sending one to ExecuteSQL and one through the other flow, joining back up at MergeContent.
Another alternative is to use ConvertJSONToSQL set to "UPDATE", and if it fails, route those flow files to another ConvertJSONToSQL processor set to "INSERT".

Resources