I have 4 millions user data that is looked up by their phone number. Source data is in S3. Search response time is 50 millisec and system need to be available 99.995%.
I am thinking about a nightly job like below
S3 Source Data -> Glue ETL -> CSV file or RDS -> Cache Upload job -> AWS Redis global datastore
I am leaning towards RDS since in case upload job fails for some reason I don't want to restart from beginning. Every row uploaded to Redis will be mark as processed.
I was planning to do an initial load of data from S3 to Redis and then do incremental load from thereafter. The data team has informed me that they can't produce the incremental data, in other words they can't tell me what data changed yesterday.
As a result, every day I will have to empty the Redis cache and reload the entire data. Since this upload will take considerable time, I am wondering how to keep the response time to 50 ms while data is loading.
Will appreciate any ideas. Thanks!
Related
Context:
I am an information architect (not a data engineer, was once a Unix and Oracle developer), so my technical knowledge in Azure is limited to browsing Microsoft documentation.
The context of this problem is ingesting data from a constantly growing CSV file, in Azure ADLS into an Azure SQL MI database.
I am designing an Azure data platform that includes a SQL data warehouse with the first source system being a Dynamics 365 application.
The data warehouse is following Data Vault 2.0 patterns. This is well suited to the transaction log nature of the CSV files.
This platform is in early development - not in production.
The CSV files are created and updated (append mode) by an Azure Synapse Link that is exporting dataverse write operations on selected dataverse entities to our ADLS storage account. This service is configured in append mode, so all dataverse write operations (create, update and delate) produce an append action to the entities corresponding CSV file. Each CSV file is essentially a transaction log of the corresponding dataverse entity
Synapse Link operates in an event based fashion - creating a records in dataverse triggers a CSV append action. Latency is typically a few seconds. There aren't any SLAs (promises), and latency can be several minutes if the API caps are breached.
The CSV is partitioned Annually. This means the a new CSV file is created at the start of each year and continues to grow throughout the year.
We are currently trialling ADF as the means of extracting records from the CSV for loading into the data warehouse. We are not wedded to ADF and can consider changing horses.
Request:
I'm searching for an event based solution for ingesting that monitors a source CSV file for new records (appended to the end of the file) and extracts only those new records from the CSV file and then processes each record in sequence which result in one or more SQL insert operations for each new CSV record. If I was back in my old Unix days, I would build a process around the "tail -f" command as the start of the pipeline with the next step an ETL process that processed each record served by the tail command. But I can't figure out how to do this in Azure.
This process will be the pattern for many more similar ingestion processes - there could be approximately one thousand CSV files that need to be processed in this event based - near real time process. I assume one process per CSV file.
Some nonfunctional requirements are speed and efficiency.
My goal is for an event based solution (low latency = speed),
that doesn't need to read the entire file every 5 minutes to see if there are changes. This is an inefficient (micro) batch process that will be horribly inefficient (read: expensive - 15,000x redundant processing). This is where the desire for a process like Unix "tail -f" comes to mind. It watches the file for changes, emitting new data as it is appended to the source file. I'd hate to do something like a 'diff' every 5 minutes as this is inefficient and when scaled to thousands of tables will be prohibitively expensive.
One possible solution to your problem is to store each new CSV record as a separate blob.
You will then be able to use Azure Event Grid to raise events when a new blob is created in Blob Storage i.e. use Azure Blob Storage as Event Grid source.
The basic idea is to store the changed CSV data as new blob and have Event Grid wired to Blob Created event. An Azure Function can listen to these events and then only process the new data. For auditing purposes, you can save this data in a separate Append Blob once the CSV processing has been completed.
to store data in s3 bucket form Databricks, I used to write the following:
df.write.format("delta").mode("overwrite").save("s3://.....")
the code above used to take 3.27 minutes, it takes now 7.37 minutes, using the same
cluster configuration and the same data.
I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?
There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.
My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.
I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*