Best way to read a large csv (10gb) and after computation store the data in db - performance

I have ~4gb of text file which I parse and save the data in a db. This process almost take 3-4hr(5-6 million lines) to process and save data in db. And this is a everyday process.
Now when I query the db its taking too much time to compute result and return. Like if I do a simple avg, sum operation for a particular day its taking 30-40mins.
I am using python, mysql right now. Tried Spark also to do this computation which also taking 30-40 min and now data is increasing so file size will increase and it will be like 10gb, which spark is not able to handle large files.
Please suggest how can I improve this time of parsing, storing in db, and fetching time.

I do not know what database you are using, but maybe you could switch?
I suggest using Impala + AVRO schema. You will probably need to refresh/create table using HIVE, as Impala lacks some functionalities in the administrative area.
I've used it storing files on HDFS and grouping and then summing 45GB of float took me about 40 seconds on 4 machines. You spend no time putting anything to database as the source are files themselves. All time you need is to store files in HDFS, but it's as fast as any FS.

Related

How to improve performance of csv to parquet file format using pyspark?

I have a large dataset I need to convert from csv to parquet format using pyspark. There is approximately 500GB of data scattered across thousands of csv files. My initial implementation is simplistic ...
spark = SparkSession.builder \
.master("local") \
.appName("test") \
.getOrCreate()
df = spark.read.csv(input_files, header=True, inferSchema=True)
df.repartition(1).write.mode('overwrite').parquet(output_dir)
The performance is abysmal, I have let it run for 2+ hours before giving up. From logging output I infer it does not even complete reading the csv files into the dataframe.
I am running spark locally on a server with 128 high performance CPU cores and 1TB of memory. Disk storage is SSD based with confirmed read speeds of 650 MB/s. My intuition is that I should be able to significantly improve performance given the computing resources available. I'm looking for tips on how to do this.
I have tried...
not inferring schema, this did not produce a noticeable difference in performance (The schema is four columns of text)
using the configuration setting spark.executor.cores to match the number of physical cores on my server. The setting did not seem to have any effect, I did not observe the system using more cores.
I'm stuck using pyspark for now per management direction, but if necessary I can convince them to use a different tool.
Some suggestions based on my experience working with spark :
You should not infer the schema if you are dealing with huge data. It might not show significant improvement in the performance but definitely it would still save you some time.
Don't use repartition(1) as it would shuffle the data and create a single partition with data and that is what you don't want with huge volume of data that you have. I would suggest you to increase the number of partitions if possible based on the cluster configuration you have in order to get the parquet files saved faster.
Don't Cache/persist your data frame if you are just reading the csv files and then in the next step saving it as parquet files. It can increase your saving time as caching itself takes some time. Caching the data frame would have helped if you were performing multiple transformations on the data frame and then performing multiple actions on it. your are performing only one action of writing the data frame as parquet file, so according to me you should not cache the data frame.
Some possible improvments :
Don't use .repartition(1) as you lose parallelism for writing operation
Persisit/cache the dataframe before writing : df.persist()
If you really need to save it as 1 parquet file, you can first write into temp folder without reducing partitions then use coalesce in a second write operation :
df = spark.read.csv(input_files, header=True, inferSchema=True).persist()
# ....
df.write.mode('overwrite').parquet("/temp/folder")
df.unpersist()
df1 = spark.read.parquet("/temp/folder")
df1.coalesce(1).write.mode('overwrite').parquet(output_dir)

Hadoop Performance When retrieving Data Only

We know that performance of Hadoop may be increased by adding more data nodes. My question is: if we want to retrieve the data only without the need to process it or analyze it, is adding more data nodes will be useful? or it won't increase performance at all because we have retrieve operations only without any computations or map reduce jobs?
I will try to answer in parts:
If you only retrieve information from a hadoop cluster or HDFS then
it is similar to Cat command in linux, meaning only reading data
not processing.
If you want some calculations like SUM, AVG or any other aggregate
functions on top of your data then comes the concept of REDUCE ,
hence Map reduce comes into picture.
So hadoop is useful or worthy when your data is Huge and you do
calculations also. I think their is no performance benefits while
reading a small amount of data in HDFS than reading a Large amount
of data in HDFS (just think like you are storing your data in RDBMS
regularly and you only query select * statements on daily basis),
but when your data grows exponentially and you want to do
calculations your RDBMS query would take time to execute.
For Map reduce to work efficiently on huge data sets , you need to
have good amount of nodes and computing power, depending upon your
use case.

To hadoop or not to hadoop

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Is Hadoop Suited to Serve 100 byte Records Out of 50GB Dataset?

We have a question on whether Hadoop is suitable for simple tasks that require no application running, but require very fast reads and writes of small amount of data.
The requirement is to be able to write roughly a 100-200 bytes long messages with couple of indexes at rate 30 per second, at the same time to be able to read (search by those two indexes) at rate roughly 10 per seconds. The read queries must be very fast - 100-200 milliseconds max per query and return few matching records.
The total data volume is expected to reach 50-100 gb and is to be maintained at this rate by removing older records (something like daily task to delete records that are older than 14 days)
As you can see the total data volume is not really that big, but we are concerned that the search speed of Hadoop may be slower than our need anyway.
Is Hadoop a solution for this?
Thanks
Nik
Hadoop, alone, is very bad at serving out many small segments of data. However, HBase is an indexed table database-like system meant to be run on top of Hadoop. It is excellent at serving out small indexed files. I would research that as a solution.
Another problem to keep an eye on is that importing data into HDFS or HBase is not trivial. It can slow your cluster down quite a bit, so if Hadoop is your choice, you have to also solve how to get those 75GB into HDFS so Hadoop can touch them.
As Sam noted HBase is the Hadoop stack solution that can handle your requirements. However I wouldn't go with Hadoop if these are your only requirements from the data.
You can go with other NoSQL solutions like MongoDB or CouchDB or even MySQL or Postgres

Resources