Excessive memory usage when using dask dataframe created from parquet file - parquet

I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe:
import dask.dataframe as dd
dask_train_df = dd.read_parquet('train.parquet')
dask_train_df.info()
This yields:
<class 'dask.dataframe.core.DataFrame'>
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
When I try to do simple operations like dask_train_df.head() or dask_train_df.loc[2:4].compute(), I get memory errors, even with 17+ GB of RAM.
However, if I do:
import pandas as pd
train = pd.read_parquet('../input/train.parquet')
train.info()
yields:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
memory usage: 6.5 GB
and I can run train.head() and train.loc[2:4]
with no problems since everything is in memory already.
1) So my question is why do these simple operations blow up the memory usage using a Dask Dataframe, but works fine with when I load everything into memory using a Pandas Dataframe?
I notice that npartitions=1, and I see that in the documentation that read_parquet "reads a directory of Parquet data into a Dask.dataframe, one file per partition". In my case, it sounds like I'm losing out on all of the parallelization power of having multiple partitions, but then shouldn't the Dask Dataframe memory usage be capped by the amount of memory of the single Pandas Dataframe?
2) Also, a side question: If I wanted to parallelize this single parquet file by partitioning it in a Dask Dataframe, how would I do so? I don't see a blocksize parameter in the dd.read_parquet signature. I also tried using the repartition function, but I believe that partitions along the rows and in a parquet file, I would want to partition along the columns?

First, I would like to comment that 8712 columns is rather many, and you will find that parsing the schema/metadata may take significant time, never mind the data loading.
When fastparquet loads data, it first allocates a dataframe of sufficient size, then iterates through the columns/chunks (with appropriate overheads, which apparently are small in this case) and assigns values into the allocated dataframe.
When you run a calculation through Dask (any calculation), there can in many cases be intra-task copies in memory of the input variables and other intermediate objects. That is usually not an issue, as the whole data-set should be split into many parts, and the small intermediates' memory overhead is a price worth paying for being able to handle datasets larger than memory. I am not sure at which point you are getting a copy, it may be worth investigating and preventing.
In your case, the whole data-set is a single partition. This will result in a single load task, running in one thread. You will not be getting any parallelism, and any intermediate internal copies apply to the whole dataset. You could load only part of the data by selecting columns, and so manufacture partitions and achieve parallelism that way. However, the typical way to handle parquet data is to make use of "row-group" partitions (i.e., along the index) and multiple files, so the real way to avoid the problem is to use data which is already appropriately partitioned.
Note that since you can load the data directly with fastparquet/pandas, you could probably also save a partitioned version either with the to_parquet method or fastparquet's write function.

Related

How to improve performance of csv to parquet file format using pyspark?

I have a large dataset I need to convert from csv to parquet format using pyspark. There is approximately 500GB of data scattered across thousands of csv files. My initial implementation is simplistic ...
spark = SparkSession.builder \
.master("local") \
.appName("test") \
.getOrCreate()
df = spark.read.csv(input_files, header=True, inferSchema=True)
df.repartition(1).write.mode('overwrite').parquet(output_dir)
The performance is abysmal, I have let it run for 2+ hours before giving up. From logging output I infer it does not even complete reading the csv files into the dataframe.
I am running spark locally on a server with 128 high performance CPU cores and 1TB of memory. Disk storage is SSD based with confirmed read speeds of 650 MB/s. My intuition is that I should be able to significantly improve performance given the computing resources available. I'm looking for tips on how to do this.
I have tried...
not inferring schema, this did not produce a noticeable difference in performance (The schema is four columns of text)
using the configuration setting spark.executor.cores to match the number of physical cores on my server. The setting did not seem to have any effect, I did not observe the system using more cores.
I'm stuck using pyspark for now per management direction, but if necessary I can convince them to use a different tool.
Some suggestions based on my experience working with spark :
You should not infer the schema if you are dealing with huge data. It might not show significant improvement in the performance but definitely it would still save you some time.
Don't use repartition(1) as it would shuffle the data and create a single partition with data and that is what you don't want with huge volume of data that you have. I would suggest you to increase the number of partitions if possible based on the cluster configuration you have in order to get the parquet files saved faster.
Don't Cache/persist your data frame if you are just reading the csv files and then in the next step saving it as parquet files. It can increase your saving time as caching itself takes some time. Caching the data frame would have helped if you were performing multiple transformations on the data frame and then performing multiple actions on it. your are performing only one action of writing the data frame as parquet file, so according to me you should not cache the data frame.
Some possible improvments :
Don't use .repartition(1) as you lose parallelism for writing operation
Persisit/cache the dataframe before writing : df.persist()
If you really need to save it as 1 parquet file, you can first write into temp folder without reducing partitions then use coalesce in a second write operation :
df = spark.read.csv(input_files, header=True, inferSchema=True).persist()
# ....
df.write.mode('overwrite').parquet("/temp/folder")
df.unpersist()
df1 = spark.read.parquet("/temp/folder")
df1.coalesce(1).write.mode('overwrite').parquet(output_dir)

Spark not ignoring empty partitions

I am trying to read a subset of a dataset by using pushdown predicate.
My input dataset consists in 1,2TB and 43436 parquet files stored on s3. With the push down predicate I am supposed to read 1/4 of data.
Seeing the Spark UI. I see that the job actually reads 1/4 of data (300GB) but there are still 43436 partitions in the first stage of the job however only 1/4 of these partitions has data, the other 3/4 are empty ones (check the median input data in the attached screenshots).
I was expecting Spark to create partitions only for non empty partitions. I am seeing a 20% performance overhead when reading the whole dataset with the pushdown predicate comparing to reading the prefiltred dataset by another job (1/4 of data) directly. I suspect that this overhead is due to the huge number of empty partitions/tasks I have in my first stage, so I have two questions:
Are there any workaround to avoid these empty partitions?
Do you think to any other reason responsible for the overhead? may be the pushdown filter execution is naturally a little bit slow?
Thank you in advance
Using S3 Select, you can retrieve only a subset of data.
With Amazon EMR release version 5.17.0 and later, you can use S3 Select with Spark on Amazon EMR. S3 Select allows applications to retrieve only a subset of data from an object.
Otherwise, S3 acts as an object store, in which case, an entire object has to be read. In your case you have to read all content from all files, and filter them on client side.
There is actually very similar question, where by testing you can see that:
The input size was always the same as the Spark job that processed all of the data
You can also see this question about optimizing data read from s3 of parquet files.
Seems like your files are rather small: 1.2TB / 43436 ≈ 30MB. So you may want to look at increasing the spark.sql.files.maxPartitionBytes, to see if it reduces the total number of partitions. I have not much experience with S3, so not sure whether its going to help given this note in its description:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
Empty partitions: It seems that spark (2.4.5) tries to really have partitions with size ≈ spark.sql.files.maxPartitionBytes (default 128MB) by packing many files into one partition, source code here.
However it does this work before running the job, so it can't know that 3/4 of files will not output data after the pushed down predicate being applied. For the partitions where it will put only files whose lines will be filtered out, I ended up with empty partitions. This explains also why my max partition size is 44MB and not 128MB, because none of the partitions had by chance files that passed all the pushdown filter.
20% Overhead: Finally this is not due to empty partitions, I managed to have much less empty partitions by setting spark.sql.files.maxPartitionBytes to 1gb but it didn't improve reading. I think that the overhead is due to opening many files and reading their metadata.
Spark estimates that opening a file is equivalent to reading 4MB spark.sql.files.openCostInBytes. So opening many files even if thanks to the filter won't be read shouldn't be negligible..

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

Spark performance analysis for joins

Input data
I have two tables exported from MySQL as csv files.
Table 1 size on disk : 250 MB
Records : 0.7 Million
Table 2 size on disk : 350 MB
Records : 0.6 Million
Update for code
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val table-one = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-1-data.csv”)
table-one.registerTempTable(“table-one”)
val table-two = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-2-data.csv”)
table-two.registerTempTable(“table”-two)
sqlContext.cacheTable(“table-one”)
sqlContext.cacheTable(“table-two”)
val result = sqlContext.sql("SELECT table-one.ID,table-two.ID FROM table-one LEFT JOIN table-two ON table-one.ID = table-two.ID")
result.take(2).foreach(println)
The Spark Job
Read the two csv files using Databricks CSV
lib and register them as
tables.
Perform a left join on both using a common column, a typical left
join in relational db speak.
Print the top two results,since printing on console itself will
consume time.
This takes 30 seconds on the whole.I am running on a single machine with enough memory so that both the files can fit in (Its 600Mb after all).
There were two ways that I ran the job.
Run the job as a whole i.e load all the csv, run the joins and then print the results
Second way I first ran and cached the tables in memory using sqlContext.cacheTable("the_table")
After caching I found that the join operation itself took 8 seconds to complete.
Is this time reasonable ? I am guessing its not and there are lot of optimisations that can be done to speed up the query.
Optimizations that I see
Putting the data into HDFS instead of local disk. Will this speed up the retrieval ?
Running on a cluster,I am guessing that this will not be fast since the data can fit into memory and sequential will be faster.
Will modelling the data and using cassandra will be faster?
I am using plain SQL to join, will a RDD join be faster ?
Is there any other way to do things better ?
As mentioned by the commenters, Spark is designed for distributed computing. The overhead alone for all the initialization and scheduling is enough to make Spark seem slow compared to other PL's when working locally on small(ish) data.
Running on a cluster,I am guessing that this will not be fast since
the data can fit into memory and sequential will be faster.
The executors will actually work on its local copy of data in memory for as long as your code performs narrow transformations so this is not exactly correct. Your code performs a join, however, which is a wide transformation - meaning the blocks will have to be shuffled across the network. Keep this in mind. Wide transformations are expensive so as much as possible, put them at the end of a DAG. But again, your data is small enough that you might not see the benefits.
Another thing is that if you have Hive then you could consider storing the data in a table partitioned on your join column.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Resources