I'm trying to store/persist kdb tables in compressed apache parquet format.
My initial plan is basically to use embedPy to convert either fastparquet or pyarrow.parquet to be usable from within q.
I'll then use the kdb+ tick architecture to process incoming ticks and run daily batched parquet writes to disk.
Would this be a good idea? Otherwise what would be the best method of persisting large amounts of data to disk? thanks
Related
I have a large dataset I need to convert from csv to parquet format using pyspark. There is approximately 500GB of data scattered across thousands of csv files. My initial implementation is simplistic ...
spark = SparkSession.builder \
.master("local") \
.appName("test") \
.getOrCreate()
df = spark.read.csv(input_files, header=True, inferSchema=True)
df.repartition(1).write.mode('overwrite').parquet(output_dir)
The performance is abysmal, I have let it run for 2+ hours before giving up. From logging output I infer it does not even complete reading the csv files into the dataframe.
I am running spark locally on a server with 128 high performance CPU cores and 1TB of memory. Disk storage is SSD based with confirmed read speeds of 650 MB/s. My intuition is that I should be able to significantly improve performance given the computing resources available. I'm looking for tips on how to do this.
I have tried...
not inferring schema, this did not produce a noticeable difference in performance (The schema is four columns of text)
using the configuration setting spark.executor.cores to match the number of physical cores on my server. The setting did not seem to have any effect, I did not observe the system using more cores.
I'm stuck using pyspark for now per management direction, but if necessary I can convince them to use a different tool.
Some suggestions based on my experience working with spark :
You should not infer the schema if you are dealing with huge data. It might not show significant improvement in the performance but definitely it would still save you some time.
Don't use repartition(1) as it would shuffle the data and create a single partition with data and that is what you don't want with huge volume of data that you have. I would suggest you to increase the number of partitions if possible based on the cluster configuration you have in order to get the parquet files saved faster.
Don't Cache/persist your data frame if you are just reading the csv files and then in the next step saving it as parquet files. It can increase your saving time as caching itself takes some time. Caching the data frame would have helped if you were performing multiple transformations on the data frame and then performing multiple actions on it. your are performing only one action of writing the data frame as parquet file, so according to me you should not cache the data frame.
Some possible improvments :
Don't use .repartition(1) as you lose parallelism for writing operation
Persisit/cache the dataframe before writing : df.persist()
If you really need to save it as 1 parquet file, you can first write into temp folder without reducing partitions then use coalesce in a second write operation :
df = spark.read.csv(input_files, header=True, inferSchema=True).persist()
# ....
df.write.mode('overwrite').parquet("/temp/folder")
df.unpersist()
df1 = spark.read.parquet("/temp/folder")
df1.coalesce(1).write.mode('overwrite').parquet(output_dir)
I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.
My scenario like this;
All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?
On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
If I use Cassandra, how can I do batch analysis?
But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
Think of Kafka as a pipe into these stores. It is not a replacement to use "instead of" either. HBase and Cassnadras are stores, and you need to "batch" the data out of them... You would use Kafka Streams (or Spark, Flink, or my personal favorite NiFi) for actual (near) real-time processing before these systems.
I would suggest using Kafka rather than have point-to-point metrics into Hadoop (or related tools). I would also encourage using something meant for such data like TimescaleDB, CrateDB or InfluxDB, maybe Prometheus with some modification to the infrastructure... You could use Kafka for ingestion into both Hadoop and these other tools that are better tuned to store such datasets (which is the benefit of "buffering" the data in Kafka first)
does it make sense if each incoming time series data are written as a parquet format?
Sure. If you want to store lots of data for large batch analysis. But if you window your stream hourly data-points, and perform sums and averages, for example, then do you really need to store each and every datapoint?
If I use Cassandra, how can I do batch analysis?
Well, I would hope the same way you currently do it. Schedule a query against a database? Hope all the data is there? (no late-arriving records)
I have real-time time series sensor data. My primary goal is to keep the raw data. I should do this so that the cost of storage is minimal.
If your requirement is storing the raw data, you can write them to hdfs is compressed form. Using parquet format might not be feasible here. Formats can change.
If you have the incoming data in kafka you can use kafka connect to write to hdfs in batches from a topic.
All sensors produces time series data, and i must save these raw time series data for batch analysis. Parquet format is great for less storage cost. But, does it make sense if each incoming time series data are written as a parquet format?
Not sure if I understand correctly, but it does not make any sense to store each data point in a seperate parquet file.
parquet format has overhead compared to raw data
parquet format is specifically designed for having table like data with many rows, so that filtering on that data is fast (with local access).
batch processing and filesystems are most of the time really unhappy about lots of small files.
On the other hand, I want to process each incoming time series data in real time. For real-time scenario; I can use Kafka. But, can Hbase or Cassandra be used for both batch and real-time analysis instead of Kafka?
Depending on your use case it might be easy enough for batch processing to use hive or spark sql on the raw data.
Maybe kafka-streams processor is enough for your real time requirements.
So many options. It all depends on the use cases...
I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!
Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?
Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro
Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet
HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.
If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,
job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());
for
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());
The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.
In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.
I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.
Avro
Widely used as a serialization platform
Row-based, offers a compact and fast binary format
Schema is encoded on the file so the data can be untagged
Files support block compression and are splittable
Supports schema evolution
Parquet
Column-oriented binary file format
Uses the record shredding and assembly algorithm described in the Dremel paper
Each data file contains the values for a set of rows
Efficient in terms of disk I/O when specific columns need to be queried
From Choosing an HDFS data storage format- Avro vs. Parquet and more
Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file.
The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:
Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.
Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.
Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing. Both Parquet and Avro supports schema evolution but to a varying degree.
Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.
Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.
We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.
Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!
I have ~4gb of text file which I parse and save the data in a db. This process almost take 3-4hr(5-6 million lines) to process and save data in db. And this is a everyday process.
Now when I query the db its taking too much time to compute result and return. Like if I do a simple avg, sum operation for a particular day its taking 30-40mins.
I am using python, mysql right now. Tried Spark also to do this computation which also taking 30-40 min and now data is increasing so file size will increase and it will be like 10gb, which spark is not able to handle large files.
Please suggest how can I improve this time of parsing, storing in db, and fetching time.
I do not know what database you are using, but maybe you could switch?
I suggest using Impala + AVRO schema. You will probably need to refresh/create table using HIVE, as Impala lacks some functionalities in the administrative area.
I've used it storing files on HDFS and grouping and then summing 45GB of float took me about 40 seconds on 4 machines. You spend no time putting anything to database as the source are files themselves. All time you need is to store files in HDFS, but it's as fast as any FS.
I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.