Spark performance analysis for joins - performance

Input data
I have two tables exported from MySQL as csv files.
Table 1 size on disk : 250 MB
Records : 0.7 Million
Table 2 size on disk : 350 MB
Records : 0.6 Million
Update for code
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val table-one = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-1-data.csv”)
table-one.registerTempTable(“table-one”)
val table-two = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("example-input-files/table-2-data.csv”)
table-two.registerTempTable(“table”-two)
sqlContext.cacheTable(“table-one”)
sqlContext.cacheTable(“table-two”)
val result = sqlContext.sql("SELECT table-one.ID,table-two.ID FROM table-one LEFT JOIN table-two ON table-one.ID = table-two.ID")
result.take(2).foreach(println)
The Spark Job
Read the two csv files using Databricks CSV
lib and register them as
tables.
Perform a left join on both using a common column, a typical left
join in relational db speak.
Print the top two results,since printing on console itself will
consume time.
This takes 30 seconds on the whole.I am running on a single machine with enough memory so that both the files can fit in (Its 600Mb after all).
There were two ways that I ran the job.
Run the job as a whole i.e load all the csv, run the joins and then print the results
Second way I first ran and cached the tables in memory using sqlContext.cacheTable("the_table")
After caching I found that the join operation itself took 8 seconds to complete.
Is this time reasonable ? I am guessing its not and there are lot of optimisations that can be done to speed up the query.
Optimizations that I see
Putting the data into HDFS instead of local disk. Will this speed up the retrieval ?
Running on a cluster,I am guessing that this will not be fast since the data can fit into memory and sequential will be faster.
Will modelling the data and using cassandra will be faster?
I am using plain SQL to join, will a RDD join be faster ?
Is there any other way to do things better ?

As mentioned by the commenters, Spark is designed for distributed computing. The overhead alone for all the initialization and scheduling is enough to make Spark seem slow compared to other PL's when working locally on small(ish) data.
Running on a cluster,I am guessing that this will not be fast since
the data can fit into memory and sequential will be faster.
The executors will actually work on its local copy of data in memory for as long as your code performs narrow transformations so this is not exactly correct. Your code performs a join, however, which is a wide transformation - meaning the blocks will have to be shuffled across the network. Keep this in mind. Wide transformations are expensive so as much as possible, put them at the end of a DAG. But again, your data is small enough that you might not see the benefits.
Another thing is that if you have Hive then you could consider storing the data in a table partitioned on your join column.

Related

How to improve performance of csv to parquet file format using pyspark?

I have a large dataset I need to convert from csv to parquet format using pyspark. There is approximately 500GB of data scattered across thousands of csv files. My initial implementation is simplistic ...
spark = SparkSession.builder \
.master("local") \
.appName("test") \
.getOrCreate()
df = spark.read.csv(input_files, header=True, inferSchema=True)
df.repartition(1).write.mode('overwrite').parquet(output_dir)
The performance is abysmal, I have let it run for 2+ hours before giving up. From logging output I infer it does not even complete reading the csv files into the dataframe.
I am running spark locally on a server with 128 high performance CPU cores and 1TB of memory. Disk storage is SSD based with confirmed read speeds of 650 MB/s. My intuition is that I should be able to significantly improve performance given the computing resources available. I'm looking for tips on how to do this.
I have tried...
not inferring schema, this did not produce a noticeable difference in performance (The schema is four columns of text)
using the configuration setting spark.executor.cores to match the number of physical cores on my server. The setting did not seem to have any effect, I did not observe the system using more cores.
I'm stuck using pyspark for now per management direction, but if necessary I can convince them to use a different tool.
Some suggestions based on my experience working with spark :
You should not infer the schema if you are dealing with huge data. It might not show significant improvement in the performance but definitely it would still save you some time.
Don't use repartition(1) as it would shuffle the data and create a single partition with data and that is what you don't want with huge volume of data that you have. I would suggest you to increase the number of partitions if possible based on the cluster configuration you have in order to get the parquet files saved faster.
Don't Cache/persist your data frame if you are just reading the csv files and then in the next step saving it as parquet files. It can increase your saving time as caching itself takes some time. Caching the data frame would have helped if you were performing multiple transformations on the data frame and then performing multiple actions on it. your are performing only one action of writing the data frame as parquet file, so according to me you should not cache the data frame.
Some possible improvments :
Don't use .repartition(1) as you lose parallelism for writing operation
Persisit/cache the dataframe before writing : df.persist()
If you really need to save it as 1 parquet file, you can first write into temp folder without reducing partitions then use coalesce in a second write operation :
df = spark.read.csv(input_files, header=True, inferSchema=True).persist()
# ....
df.write.mode('overwrite').parquet("/temp/folder")
df.unpersist()
df1 = spark.read.parquet("/temp/folder")
df1.coalesce(1).write.mode('overwrite').parquet(output_dir)

Excessive memory usage when using dask dataframe created from parquet file

I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe:
import dask.dataframe as dd
dask_train_df = dd.read_parquet('train.parquet')
dask_train_df.info()
This yields:
<class 'dask.dataframe.core.DataFrame'>
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
When I try to do simple operations like dask_train_df.head() or dask_train_df.loc[2:4].compute(), I get memory errors, even with 17+ GB of RAM.
However, if I do:
import pandas as pd
train = pd.read_parquet('../input/train.parquet')
train.info()
yields:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
memory usage: 6.5 GB
and I can run train.head() and train.loc[2:4]
with no problems since everything is in memory already.
1) So my question is why do these simple operations blow up the memory usage using a Dask Dataframe, but works fine with when I load everything into memory using a Pandas Dataframe?
I notice that npartitions=1, and I see that in the documentation that read_parquet "reads a directory of Parquet data into a Dask.dataframe, one file per partition". In my case, it sounds like I'm losing out on all of the parallelization power of having multiple partitions, but then shouldn't the Dask Dataframe memory usage be capped by the amount of memory of the single Pandas Dataframe?
2) Also, a side question: If I wanted to parallelize this single parquet file by partitioning it in a Dask Dataframe, how would I do so? I don't see a blocksize parameter in the dd.read_parquet signature. I also tried using the repartition function, but I believe that partitions along the rows and in a parquet file, I would want to partition along the columns?
First, I would like to comment that 8712 columns is rather many, and you will find that parsing the schema/metadata may take significant time, never mind the data loading.
When fastparquet loads data, it first allocates a dataframe of sufficient size, then iterates through the columns/chunks (with appropriate overheads, which apparently are small in this case) and assigns values into the allocated dataframe.
When you run a calculation through Dask (any calculation), there can in many cases be intra-task copies in memory of the input variables and other intermediate objects. That is usually not an issue, as the whole data-set should be split into many parts, and the small intermediates' memory overhead is a price worth paying for being able to handle datasets larger than memory. I am not sure at which point you are getting a copy, it may be worth investigating and preventing.
In your case, the whole data-set is a single partition. This will result in a single load task, running in one thread. You will not be getting any parallelism, and any intermediate internal copies apply to the whole dataset. You could load only part of the data by selecting columns, and so manufacture partitions and achieve parallelism that way. However, the typical way to handle parquet data is to make use of "row-group" partitions (i.e., along the index) and multiple files, so the real way to avoid the problem is to use data which is already appropriately partitioned.
Note that since you can load the data directly with fastparquet/pandas, you could probably also save a partitioned version either with the to_parquet method or fastparquet's write function.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

pig skewed join with a big table causes "Split metadata size exceeded 10000000"

We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table.
A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.
HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:
Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]
This is reproducible every time we try using skewed, and does not happen when we use the regular join.
we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!
What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?
Small table of 1MB is small enough to fit into memory, try replicated join.
Replicated join is Map only, does not cause Reduce stage as other types of join, thus is immune to the skew in the join keys. It should be quick.
big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
Big table is always the first one in the statement.
UPDATE 1:
If small table in its original form does not fit into memory,than as a work around you would need to partition your small table into partitions that are small enough to fit into memory and than apply the same partitioning to the big table, hopefully you could add the same partitioning algorithm to the system which creates big table, so that you do not waste time repartitioning it.
After partitioning, you can use replicated join, but it will require running pig script for each partition separately.
In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:
mapreduce.job.split.metainfo.maxsize=-1

Join performance on AWS elastic map reduce running hive

I am running a simple join query
select count(*) from t1 join t2 on t1.sno=t2.sno
Table t1 and t2 both have 20 million records each and column sno is of string data type.
The table data is imported in to HDFS from Amazon s3 in rcfile format.
The query took 109s with 15 Amazon large instances however it takes 42sec on sql server with 16 GB RAM and 16 cpu cores.
Am I missing anything? Can't understand why am I getting slow performance on Amazon?
Some questions to help you tune Hadoop performance:
What does your IO utilization look like on those instances? Maybe large instances are not the right balance of CPU / Disk / Memory for the job.
How are your files stored? Is it a single file, or many small files? Hadoop isn't so hot with many small files, even if they're combinable
How many reducers did you run? You want to have about 0.9*totalReduceCapacity as ideal
How skewed is your data? If there are many records with the same key they will all go to the same reducer, and you'll have O(n*n) upper bound in that reducer if you're not careful.
sql-server might be fine with 40mm records, but wait till you have 2bn records and see how it does. It will probably just break. I'd see hive more as a clever wrapper for Map Reduce rather than an alternative to a real database.
Also from experience I think having 15 c1.mediums might perform just as well as the large machines, if not better. the large machines don't have the right balance of CPU/Memory honestly.

Resources