How to write incremental data to hive using flink - hadoop

I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?

As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).

Related

How can we delete specific rows from HDFS?

We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
First create a hive table on the top of that hdfs location , then create another one from first hive table with filter logic.Now delete the first hive table.Make sure that tables should be internal.
The concept of a "row" only makes sense for line-delimited data. For example, if you had Parquet data, or XML files... You want to delete records.
One does not simply "delete records" from HDFS files. HDFS is an append only filesystem.
If the data is already on HDFS, the best you can do is read the files, filter out data you don't want (using whatever tool you want - Pig or Spark would be the easiest IMO), then write a new file, optionally overwriting the old data.
To prevent this from happening, you need an ETL process between the data source and HDFS which sanitizes the data ahead of time.

What are Hive Common Use Cases?

I'm new to Hive; so, I'm not sure how companies use Hive. Let me give you a scenario and see if I'm conceptually correct about the use of Hive.
Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs. So, I create a table columns of which correspond to the columns in the log file. Then I load the log file into the table. Now, I can start query the data. So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.
Is that scenario above a common use? And if it is, then how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this. It's not a search engine and will take minutes just to find any reasonable data you're looking for.
HBase would probably be a better alternative if you must stay within the Hadoop ecosystem. (Hive can query Hbase)
Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.
But to answer your questions
how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
Use an EXTERNAL Hive table over an HDFS location for your logs. Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)
You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries). You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS. Or run ALTER TABLE ADD PARTITION yourself on a schedule. Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you
If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format

Connecting HIVE in MRJob

The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.
We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?
If not what would be the ideal approach to fulfill my requirement.
I am new to Hadoop, MRJob and Hive.
Please provide some suggestion.
"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?
About how many files are there in total which you need to scan?
One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.
You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).
Hope this helps.

Incrementally updating data on HDFS

We are copying data from various sources such as Oracle, Teradata to HDFS using Sqoop. We use incremental update feature to 'import' new data & then 'merge' it with the existing data. Data first gets populated in a temporary directory & then we 'remove' the old & 'rename' the new one.
Problem is, if a user is running a query against the data on HDFS using a tool such as Hive while we swap the directory, the query terminates abnormally.
Is there a better way to handle the updates on HDFS?
(Please note, that even though HBase keeps different versions, it doesn't work for us because we want to query by any column. HBase is very slow in cases where you don't search by primary key.)
Hadoop is not designed to work like that. It is good for storing data but not editing. I would just add new data beside old data and while adding it(copying or any other import) you could add sufix .tmp to filename. But i did not use hive that much(pig user here) and in pig i could tell A = LOAD '/some/path/to/hdfs/*.log' and that would load all files except .tmp which are importing. With that there is no problems.

Load Data into Hive from Flat files or existing database

We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.

Resources