We have one problem with storing data in HBase. We've taken such steps:
Big csv file (size: 20 G) is being processed by Spark application with hfiles as result (result data size: 180 G).
Creation of table by using command: 'TABLE_NAME', {'NAME'=>'cf', 'COMPRESSION'=>'SNAPPY'}
Data from created hfiles are bulkloaded with command hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024 hdfs://ip:8020/path TABLE_NAME
Right after loading of table the size is 180 G, however after some period of time (yesterday it was at 8pm, two days ago around 8am) a process being launched which compacts data to size 14 G.
My question is what is the name of this process? Is that a major compaction? Becouse I'm trying to trigger compaction (major_compact and compact) manually, but this is an output from command started on uncompacted table:
hbase(main):001:0> major_compact 'TEST_TYMEK_CRM_ACTION_HISTORY'
0 row(s) in 1.5120 seconds
This is compactions process. I can suggest the following reason for such big difference in table size. Using Spark application, you will not use a compression codec for HFile, because it specifies it after file creation. HFiles attachment to the table doesn't change it formate (all files in HDFS are immutable). Only after compaction process, data will be compressed. You can monition compaction process via HBase UI; it usually ran on 60000 port.
Related
I am very new to Spark and recently inherited an application which does the following :
Gets data from data source ( RDBMS , Excel , CSV etc ) in batches.
Creates DataSet using data from data sources
Writes to the parquet files
These 3 steps are happening sequentially. Read 100k - Create Data Frame(1m) - write to Parquet.
Currently system is running on a single box where we also have Spark running. We have a Spark Cluster that we are planning to use instead ( 4 workers ).
It currently takes us a very long time to write Rows of data to the disk 9 ( 1M records takes about 6 minutes ). Given that we are moving to the Spark Cluster how can I optimize this process and improve the performance.
Here is the code :
Dataset<Row> ds = spark.createDataFrame(rows, schema).coalesce(1);
create Dataset of rows of data.
ds.write().parquet(curFile.toString());
writes to the disk.
I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.
We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we donĀ“t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.
I'm facing an issue with ORC type data in hive. Needed some suggestions if someone faced similar problem.
I've huge data stored in hive table (partitioned & ORCed). The ORC data size is around 4 TB. I'm trying to copy this data to an uncompressed normal hive table (same table structure).
The process is running forever & occupying huge amount of non DFS storage in the pursuit. At present the process is running for 12 hours & has occupied 130 TB of non-DFS. That's very much abnormal for a Hadoop cluster with 20 servers.
Below are my parameters:
Hadoop running: HDP 2.4
Hive: 0.13
No. of servers: 20 (2 NN included)**
I wonder what a simple join or a normal analytics operation on this ORCed table would do. And theory tells that ORC format data increases performance for basic DML queries.
Can someone please let me know if I'm doing something wrong or is this a normal behavior? With ORCed data, this is my first experience.
Well, on a starters I saw that yarn log files are getting created in huge size. Mostly it shows the error logs only in heavy.
Thanks
So I'm developing an application that is expected to deal with large amounts of data, and as such I've decided to use Hadoop to process it.
My services node and datanodes are separated from the webapp, so I'm using HttpFS to communicate the app with Hadoop.
So, whenever a new row of data is generated in my application, should I already call the corresponding HttpFS URL to append the data to an HDFS file? Should I write this data in a file in the webserver and using a cronjob upload it to HDFS for example every hour?
Should I have the Hive table updated or should I just load the data in there whenever I need to query it?
I'm pretty new to Hadoop so any link that could help will also be useful.
I prefer below approach.
Do not call HtpFS URL to append data to HDSF file for every row update. HDFS is efficient when data file size is more than 128 MB (in Hadoop 2.x) or 64 MB (in Hadoop 1.x)
Write the data in web server. Have a rolling appender when file size reaches certain limit - in multiples of 128 MB e.g 1 GB file.
You can have hourly based cron jobs but make sure that you are sending big data file (e.g 1 GB or multiple of 128 MB) instead of just sending the log file, which is accumulated in 1 hour.
Regarding loading of data, you can use internal or external HIVE tables. Have a look at this article