Increase parallel copy in Azure Data Factory - azure-blob-storage

I'm trying to perform ADF copy activity from CosmosDB to Azure Blob Container. The degree of copy parallelism is always 1, even if I specify a greater value. what determines the parallelism value and how do I increase it ? . If the copy is from CosmosDB to another CosmosDB , I could achieve more than 1 parallel copies. But when the sink is blob container, I'm facing the issue.
Currently, the pipeline is writing into single file in Azure Blob Container. But I don't see an option to write multiple output files , incase if that helps to increase parallel copy.

As per the Official Microsoft Documentation,
The degree of copy parallelism in the copy activity means the maximum number of threads that copy activity uses to copy the data parallelly from the source to sink to increase the throughput.
It won’t generate multiple files in the sink.
It will only increase the throughput of the activity by copying concurrently using the number of threads we specify.
The copy activity only copies our source table or data to only single file despite the increase or decrease of degree of copy parallelism because these two actions have no relationship between them apart from the throughput.
If you want to copy multiple files from single cosmos db container, you can try a ForEach activity with certain inputs from cosmos db dataset with a copy activity inside it which copies the input data to a single file on each iteration of forEach. You can set the number of copies in this by giving value to Batch count of forEach.
Please go through this link from azurelib by Deepak Goyal to learn more about Parallel execution of copy activity using ForEach.

Related

exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update

I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Spark EMR S3 Processing Large No of Files

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?
There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

Resources