Incrementally updating data on HDFS - hadoop

We are copying data from various sources such as Oracle, Teradata to HDFS using Sqoop. We use incremental update feature to 'import' new data & then 'merge' it with the existing data. Data first gets populated in a temporary directory & then we 'remove' the old & 'rename' the new one.
Problem is, if a user is running a query against the data on HDFS using a tool such as Hive while we swap the directory, the query terminates abnormally.
Is there a better way to handle the updates on HDFS?
(Please note, that even though HBase keeps different versions, it doesn't work for us because we want to query by any column. HBase is very slow in cases where you don't search by primary key.)

Hadoop is not designed to work like that. It is good for storing data but not editing. I would just add new data beside old data and while adding it(copying or any other import) you could add sufix .tmp to filename. But i did not use hive that much(pig user here) and in pig i could tell A = LOAD '/some/path/to/hdfs/*.log' and that would load all files except .tmp which are importing. With that there is no problems.

Related

How to write incremental data to hive using flink

I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).

How can we delete specific rows from HDFS?

We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
First create a hive table on the top of that hdfs location , then create another one from first hive table with filter logic.Now delete the first hive table.Make sure that tables should be internal.
The concept of a "row" only makes sense for line-delimited data. For example, if you had Parquet data, or XML files... You want to delete records.
One does not simply "delete records" from HDFS files. HDFS is an append only filesystem.
If the data is already on HDFS, the best you can do is read the files, filter out data you don't want (using whatever tool you want - Pig or Spark would be the easiest IMO), then write a new file, optionally overwriting the old data.
To prevent this from happening, you need an ETL process between the data source and HDFS which sanitizes the data ahead of time.

Querying data from har archives - Apache Hive

I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.
The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir.
For example, For example, a file in hdfs:
hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt
can after archiving be accessed as:
har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt
Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.
Best Regards
In practice, the line between "managed" and "external" tables is very thin.
My suggestion:
create a "managed" table
add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not(the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore
Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).
Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

Load Data into Hive from Flat files or existing database

We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.

Resources