Can Hive periodically append or insert incremental data to the same table file in hdfs? - hadoop

I'm loading network captured data every minute from Spark streaming (from Flume exec), then aggregate data according to ip address, save to Hive at the end. To make it faster I create Hive ORC table with partition on ip address, it works well. The only issue is every minute it creates many (depends on how many ip addresses) kb small files, now I use "ALTER TABLE...CONCATENATE;" to merge them manually, but I think it could be easier, so want to ask whether there is solution that can incrementally merge/append new data to first minute table files instead of creating new table files every minute. Any suggestion is appreciated!

I give up, looks no direct solution as Hive can't append content to existing datafile for performance consideration. Now my alternative is still to concatenate it every week, the problem is query will be broken with error message (complaining it can't find data file) when it's doing concatenation, so there is big business impact. Now I'm thinking replacing Hive with HBase or Kudu which is more flexible and can provide update/delete operation.

Related

HDFS File Compaction with continuous ingestion

We have few tables in HDFS which are getting approx. 40k new files per day. We need to compact these tables every two weeks and for that we need to stop ingestion.
We have spark ingestion getting data from kafka and adding to HDFS (Hive external tables) every 30 mins. The data is queried as soon as it is ingested, our SLA is less than an hour so we can not increase the batch interval.
The tables are partition on two fields, we get older data constantly so most of the partitions are updated during each injection batch
eg:
/user/head/warehouse/main_table/state=CA/store=macys/part-00000-017258f8-aaa-bbb-ccc-wefdsds.c000.snappy.parquet
We are looking into ways to reduce number of file creations but even with that we will have to do compaction every 3/4 weeks if not two.
As most of the partitions are updated constantly, we need to stop the injection (~ 1 day) before starting compaction which is impacting our users.
I am looking for ways to compact automatically with out stopping the ingestion?
The chosen partitioning scheme is somewhat unfortunate. Still there are a couple of things you can do. I'm relying on the fact that you can change partition's location atomically in Hive (alter table ... partition ... set location):
Copy a partition's hdfs directory to a different location
Compact copied data
Copy new files that were ingested since step 1
do "alter table ... partition ... set location" to point Hive to a new compacted location.
Start ingesting to this new location (in case if this step is tricky you can just as well replace the "small" files in the original partition location with their compacted version and do "alter table ... partition ... set location" again to point Hive back to the original partition location.
You'll have to keep this process running iterating partition-by-partition on a continuous basis.
Thank you Facha for your suggestions, really appreciate it.
I am pretty new to HDFS concept so please dont mind basic questions,
What would be the impact on running queries which are accessing these specific files while doing swapping of uncompacted files with compacted files (alter table ... partition ... set location). I believe that the queries might fail. Who can we minimize the impact?
Copy a partition's hdfs directory to a different location
As we have two partitions in one table, state and store, will I have to iterate through each sub partition?
/tableName/state=CA/store=macys/file1.parquet
/tableName/state=CA/store=macys/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=CA/store=JCP/file2.parquet
/tableName/state=NY/store=macys/file1.parquet
/tableName/state=NY/store=macys/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
/tableName/state=NY/store=JCP/file2.parquet
For each state
for each store
get list of files in this dir to replace later
compact
/tableName/state=$STATE/store=$STORE (SPARK JOb?)
replace uncompacted files with compacted files
alter table ... partition ... set location
I would prefer your other suggestion in step 5 " just as well replace the "small" files in the original partition location with their compacted version"
How would I go ahead with implementing it, will it be best done with scripting or scala or some other programing language. I have basic knowledge of scripting, good experiencs in java and new to scala but can learn in couple of days.
Regards,
P

Spark read.parquet takes too much time

Hi I don't understand why this code takes too much time.
val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")
It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?
What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).
The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.
spark.sql("""
create table mydata
using parquet
options(
path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")
spark.sql("msck repair table mydata")
From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:
spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

What are Hive Common Use Cases?

I'm new to Hive; so, I'm not sure how companies use Hive. Let me give you a scenario and see if I'm conceptually correct about the use of Hive.
Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs. So, I create a table columns of which correspond to the columns in the log file. Then I load the log file into the table. Now, I can start query the data. So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.
Is that scenario above a common use? And if it is, then how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this. It's not a search engine and will take minutes just to find any reasonable data you're looking for.
HBase would probably be a better alternative if you must stay within the Hadoop ecosystem. (Hive can query Hbase)
Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.
But to answer your questions
how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
Use an EXTERNAL Hive table over an HDFS location for your logs. Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)
You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries). You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS. Or run ALTER TABLE ADD PARTITION yourself on a schedule. Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you
If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

How to keep a flat file on HDFS in sync with a large database table?

What's the best way of keeping a flat file on HDFS in sync with a large database table which may have row updates?
Tools such as sqoop seem like they'd be useful as they allow incremental extracts of new rows from tables, however I can't see an easy way of handling row updates.
What techniques can we use to handle row updates in an efficent manner? Dumping entire tables nightly is something we'd rather avoid.
Here are a couple suggestions:
Use DBInputFormat to make the database the input to your jobs, instead of having an intermediate file that you have to worry about synchronizing. If MySQL becomes a bottleneck, you can use some distributed/NoSQL database.
If you still want to use flat files, each night you can only dump the rows that changed in MySQL, along with a timestamp. Write a Hadoop job that outputs only the most recent version of each unique row.
I prefer having an updated_at field in mysql table to only get the changed data every night. After that I do a simple map reduce to apply changes on (merge with) old state.

Resources