i have one table with a lot of data:
id | title | server 1 | server 2 | server 3
--------------------------------------------
1 | item1 | 110.0.0.1| 110.0.0.2| 110.0.0.3
2 | item2 | 110.0.0.4| 110.0.0.2| 110.0.0.5
..
n | itemn | 110.0.0.1| 110.0.0.2| 110.0.0.3
I want to process all this data using spring boot and save the result in database, so for that, what is the easiest, simplest and the best why to do that ?
It seems that the map reduce of apache can do this job but it is so big and complicated to set up.
The actual use case:
one spring boot instance
select * from item;
process item by item.
The expected use case:
n spring boot instance
select * from item limit n
process item by item
consolidation of result and save in database
Have a look at Spring Batch. It allows chunking (processing the data in multiple chunks in multiple threads) and should fit your use case very well.
https://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html
What I may suggest is to have one data processing pipeline setup using Spring Batch rather than n spring boot instances.
Spring batch will have each of the steps as below:
Extract Data using Hive (select * from item) - Make sure to write them as file output to an external location.
The extracted data is the input to a MapReduce framework, where each item is processed and desired output is written.
The output of the mapreduce is consolidated in this batch step.
Another process (again distributed, if possible) to save into database.
Related
Here is my scenario:
a1. read records from table A
a2. process these records one by one and generate a new temp table B for each record
b1. read records from table B, process these records data and save it in a file
a3. tag the record from table A as finished status
A pseudo code to describe this scenario:
foreach item in items:
1. select large amount data where id=item.id then save the result to temp table_id
2. process all records in table_id then write then to a file
3. update item status
4. send message to client
This is my design:
create a Spring Batch job, set a date as its parameter
create a step1 to read records from table A
create a step2 to read records from temporary table B and start it in the processor of step1
I check the Spring Batch docs, I didn't find any related introduction about how to nest a step into a step's processor. seems the Step is the minimum unit in Spring Batch and it cannot be split.
Update
Here is the pseudo code about what I did now to solve the problem:
(I'm using spring boot 2.7.8)
def Job:
PagingItemReader(id) :
select date from temp_id
FlatFileItemWriter:
application implement commandlinerunner:
items = TableAReposiroy.SelectAllBetweenDate
for item : items:
Service.createTempTableBWithId(item.id)
Service.loadDataToTempTable(item.id)
job = createJob(item.id)
luancher.run(job)
update item status
A step is part of a job. It is not possible to start a step from within an item processor.
You need to break your requirement into different steps, without trying to "nest" steps into each others. In your case, you can do something like:
create a Spring Batch job, set a date as its parameter
create a step1 to read records from table A and generate a new temp table B
create a step2 to read records from temporary table B, process these records data and save it in a file and tag the record from table A as finished status
The writer of step2 would be a composite writer: it writes data to the file AND updates the status of processed records in table A. This way, if there is an issue writing the file, the status of records in table A would be rolled back (in fact, they are not processed correctly as the write operation failed, in which case they need to be reprocessed).
You should have single job with two steps, stepA and stepB, spring batch does provide provision for controlled flow of execution of steps, you can sequentially execute two steps. For each item Once stepA reaches its writer and writes data, stepB will start. You can configure stepB to read data written by stepA.
You can also pass data between steps using Job Execution context, Once stepA ends put data in Job Execution context, it can be accessed in stepB once it starts. This can help in your case because you can pass item identifier which stepA picked for processing and pass it to stepB so that stepB can have this identifier in its writer to update its final status.
I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )
I am bit puzzled here, I need to do a task similar to the following scenario with Spring Batch
Read Person from repository ==> I can use RepositoryItemReader
(a) Generate CSV file (FlatFileItemWriter) and (b) save CSV file in DB with the generated date (I can use RepositoryItemWriter)
But here I am struggling to understand how I can give generated CSV file output of 2a to save in DB 2b.
Consider CSV File has approx 1000+ Person Data which are processed for a single day.
is it possible to merge 2a & 2b? I thought about CompositeItemWriter but as here we are combining 1000+ employee in CSV file so it won't work.
Using a CompositeItemWriter won't work as you will be trying to write an incomplete file to the database for each chunk..
I would not merge 2a and 2b. Make each step do one thing (and do it well):
Step 1 (chunk-oriented tasklet): read persons and generate the file
Step 2 (simple tasklet) : save the file in the database
Step 1 can use the job execution context to pass the name of the generated file to step 2. You can find an example in the Passing Data To Future Steps section. Moreover, with this setup, step 2 will not run if step 1 fails (which makes sense to me).
I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process.
Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar to:
<Measure ID> <Source ID> <Measure timestamp> <Signal values>
where signal values are just string made of floating point comma separated numbers.
000123 S001 2015/04/22T10:00:00.000Z 0.0,1.0,200.0,30.0 ... 100.0
000124 S001 2015/04/22T10:05:23.245Z 0.0,4.0,250.0,35.0 ... 10.0
...
000126 S003 2015/04/22T16:00:00.034Z 0.0,0.0,200.0,00.0 ... 600.0
We would like to write interactive/batch queries to:
apply aggregation functions over signal values
SELECT *
FROM SIGNALS
WHERE MAX(VALUES) > 1000.0
To select signals that had a peak over 1000.0.
apply aggregation over aggregation
SELECT SOURCEID, MAX(VALUES)
FROM SIGNALS
GROUP BY SOURCEID
HAVING MAX(MAX(VALUES)) > 1500.0
To select sources having at least a single signal that exceeded 1500.0.
apply user defined functions over samples
SELECT *
FROM SIGNALS
WHERE MAX(LOW_BAND_FILTER("5.0 KHz", VALUES)) > 100.0)
to select signals that after being filtered at 5.0 KHz have at least a value over 100.0.
We need some help in order to:
find the correct file format to write the signals data on HDFS. I thought to Apache Parquet. How would you structure the data?
understand the proper approach to data analysis: is better to create different datasets (e.g. processing data with Spark and persisting results on HDFS) or trying to do everything at query time from the original dataset?
is Hive a good tool to make queries such the ones I wrote? We are running on Cloudera Enterprise Hadoop, so we can also use Impala.
In case we produce different derivated dataset from the original one, how we can keep track of the lineage of data, i.e. know how the data was generated from the original version?
Thank you very much!
1) Parquet as columnar format is good for OLAP. Spark support of Parquet is mature enough for production use. I suggest to parse string representing signal values into following data structure (simplified):
case class Data(id: Long, signals: Array[Double])
val df = sqlContext.createDataFrame(Seq(Data(1L, Array(1.0, 1.0, 2.0)), Data(2L, Array(3.0, 5.0)), Data(2L, Array(1.5, 7.0, 8.0))))
Keeping array of double allows to define and use UDFs like this:
def maxV(arr: mutable.WrappedArray[Double]) = arr.max
sqlContext.udf.register("maxVal", maxV _)
df.registerTempTable("table")
sqlContext.sql("select * from table where maxVal(signals) > 2.1").show()
+---+---------------+
| id| signals|
+---+---------------+
| 2| [3.0, 5.0]|
| 2|[1.5, 7.0, 8.0]|
+---+---------------+
sqlContext.sql("select id, max(maxVal(signals)) as maxSignal from table group by id having maxSignal > 1.5").show()
+---+---------+
| id|maxSignal|
+---+---------+
| 1| 2.0|
| 2| 8.0|
+---+---------+
Or, if you want some type-safety, using Scala DSL:
import org.apache.spark.sql.functions._
val maxVal = udf(maxV _)
df.select("*").where(maxVal($"signals") > 2.1).show()
df.select($"id", maxVal($"signals") as "maxSignal").groupBy($"id").agg(max($"maxSignal")).where(max($"maxSignal") > 2.1).show()
+---+--------------+
| id|max(maxSignal)|
+---+--------------+
| 2| 8.0|
+---+--------------+
2) It depends: if size of your data allows to do all processing in query time with reasonable latency - go for it. You can start with this approach, and build optimized structures for slow/popular queries later
3) Hive is slow, it's outdated by Impala and Spark SQL. Choice is not easy sometimes, we use rule of thumb: Impala is good for queries without joins if all your data stored in HDFS/Hive, Spark has bigger latency but joins are reliable, it supports more data sources and has rich non-SQL processing capabilities (like MLlib and GraphX)
4) Keep it simple: store you raw data (master dataset) de-duplicated and partitioned (we use time-based partitions). If new data arrives into partition and your already have downstream datasets generated - restart your pipeline for this partition.
Hope this helps
First, I believe Vitaliy's approach is very good in every aspect. (and I'm all for Spark)
I'd like to propose another approach, though. The reasons are:
We want to do Interactive queries (+ we have CDH)
Data is already structured
The need is to 'analyze' and not quite 'processing' of the data. Spark could be an overkill if (a) data being structured, we can form sql queries faster and (b) we don't want to write a program every time we want to run a query
Here are the steps I'd like to go with:
Ingestion using sqoop to HDFS: [optionally] use --as-parquetfile
Create an External Impala table or an Internal table as you wish. If you have not transferred the file as parquet file, you can do that during this step. Partition by, preferably Source ID, since our groupings are going to happen on that column.
So, basically, once we've got the data transferred, all we need to do is to create an Impala table, preferably in parquet format and partitioned by the column that we're going to use for grouping. Remember to do compute statistics after loading to help Impala run it faster.
Moving data:
- if we need to generate feed out of the results, create a separate file
- if another system is going to update the existing data, then move the data to a different location while creating->loading the table
- if it's only about queries and analysis and getting reports (i.e, external tables suffice), we don't need to move the data unnecessarily
- we can create an external hive table on top of the same data. If we need to run long-running batch queries, we can use Hive. It's a no-no for interactive queries, though. If we create any derived tables out of queries and want to use through Impala, remember to run 'invalidate metadata' before running impala queries on the hive-generated tables
Lineage - I have not gone deeper into it, here's a link on Impala lineage using Cloudera Navigator
I receive an input file which has 200 MM of records. The records are just a keys.
For each record from this file (which i'll call SAMPLE_FILE), i need to retrieve all records from a database (which i'll call EVENT_DATABASE ) that match key . The EVENT_DATABASE can have billions of records.
For example:
SAMPLE_FILE
1234
2345
3456
EVENT_DATABASE
2345 - content C - 1
1234 - content A - 3
1234 - content B - 5
4567 - content D - 7
1234 - content K - 7
1234 - content J - 2
So the system will iterate through each record from SAMPLE_RECORD and get all EVENTS which has the same key. For example, getting 1234 and query the EVENT_DATABASE will retrieve:
1234 - content A - 3
1234 - content B - 5
1234 - content K - 7
1234 - content J - 2
Then i will execute some calculations using the result set. For example, count, sum, mean
F1 = 4 (count)
F2 = 17 (sum(3+5+7+2))
I will approach the problem storing the EVENT_DATABASE using HBASE. Then i will run a map-reduce job, and in the map phase i will query the HBase, get he events and execute the calculations.
The process can be in batch. It is not necessary to be real time.
Does anyone suggests another architecture? Do i really need a map reduce job? Can i use another approach?
I personally solved this kind of problem using MapReduce, HDFS & HBase for batch analysis. Your approach seems to be good for implementing your use-case I am guessing you are going to store the calculations back into HBase.
Storm could also be used to implement the same usecase, but Storm really shines with streaming data & near real time processing rather than data at rest.
You don't really need to query Hbase for every single event. According to me this would be a better approach.
Create an external table in hive using your input file.
Create an external table in hive for your hbase table using Hive Hbase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
Do a join on both the tables and get the retrieve the results.
Your approach would have been good if you were only querying for a subset of your input file but since you are querying hbase for all recrods (20M), using a join would be more efficient.