We have a huge number of text files containing information about clients. We have to delete specific rows from this HDFS file; for example rows associated with the clients X, Y and Z and keeping the others.
First create a hive table on the top of that hdfs location , then create another one from first hive table with filter logic.Now delete the first hive table.Make sure that tables should be internal.
The concept of a "row" only makes sense for line-delimited data. For example, if you had Parquet data, or XML files... You want to delete records.
One does not simply "delete records" from HDFS files. HDFS is an append only filesystem.
If the data is already on HDFS, the best you can do is read the files, filter out data you don't want (using whatever tool you want - Pig or Spark would be the easiest IMO), then write a new file, optionally overwriting the old data.
To prevent this from happening, you need an ETL process between the data source and HDFS which sanitizes the data ahead of time.
Related
my requirement is to read that and generate another set of parquet data into another ADLS folder.
do i need this into spark dataframes and perform upserts ?
Parquet is like any other file format. You have to overwrite the files to perform insert, updates and deletes. It does not have ACID properties like a database.
1 - We can use SET properties with the spark dataframe to accomplish what you want. However, it compares at both the row and column level. Not as nice as an ANSI SQL.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html
2 - We can save the data in the target directory as a DELTA file. Most people are using DELTA since it has ACID properties like a database. Please see the merge statement. It allows for updates and inserts.
https://docs.delta.io/latest/delta-update.html
Additionally we can soft delete when reversing the match.
The nice thing about a delta file (table) is we can partition by date for a daily file load. Thus we can use time travel to see what happen yesterday versus today.
https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
3 - If you do not care about history and soft deletes, the easiest way to accomplish this task is to archive the old files in the target directory, then copy over the new files from the source directory to the target directory.
I want to configure a flume flow so that it takes in a CSV file as a source, checks the data, and dynamically separates each row of data into folders by year/month in HDFS. Is this possible?
I might suggest you look at using Nifi instead. I feel like it's the natural replacement for Flume.
Having said that it would seem that you might want to consider using a the spooling directory source and a hive sink (instead of hdfs). The hive partitions (Partitions on year/Month) would enable you to land the data in the Manner you are suggesting.
I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).
We are copying data from various sources such as Oracle, Teradata to HDFS using Sqoop. We use incremental update feature to 'import' new data & then 'merge' it with the existing data. Data first gets populated in a temporary directory & then we 'remove' the old & 'rename' the new one.
Problem is, if a user is running a query against the data on HDFS using a tool such as Hive while we swap the directory, the query terminates abnormally.
Is there a better way to handle the updates on HDFS?
(Please note, that even though HBase keeps different versions, it doesn't work for us because we want to query by any column. HBase is very slow in cases where you don't search by primary key.)
Hadoop is not designed to work like that. It is good for storing data but not editing. I would just add new data beside old data and while adding it(copying or any other import) you could add sufix .tmp to filename. But i did not use hive that much(pig user here) and in pig i could tell A = LOAD '/some/path/to/hdfs/*.log' and that would load all files except .tmp which are importing. With that there is no problems.
I have file on HDFS with 78 GB size
I need to create an Impala External table over it to perform some grouping and aggregation on data available
Problem
The file contain headers.
Question
Is there any way to skip headers from file while reading the file and do querying on the rest of data.
Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large
Please suggest if anyone have any idea...
Any suggestions will be appreciated....
Thanks in advance
UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as
Load data file into a temporary Hive/Impala table
Use INSERT INTO or CREATE TABLE AS on temp table to create require table
A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.
A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.