How to rename file within a partition in HIVE - hadoop

I have a date partitioned data in hive. However the file within a certain partition has a name like 000112_0. Is there a way to rename this file

There is no configuration property to do this but you can write your custom reducer class like OutputFormat to achieve this.
OutputFormat describes the output-specification for a Map-Reduce job.
The Map-Reduce framework relies on the OutputFormat of the job to:
Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.
Provide the RecordWriter implementation to be used to write out the output files of the job. Output files are stored in a FileSystem.

You can. Run:
hadoop fs -mv /path_to_file/old_name /path_to_file/new_name

Related

How to write incremental data to hive using flink

I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).

To Replace Name with Another Name in a file

I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Use elephant-bird with hive to read protobuf data

I have a similar problem like this one
The followning are what I used:
CDH4.4 (hive 0.10)
protobuf-java-.2.4.1.jar
elephant-bird-hive-4.6-SNAPSHOT.jar
elephant-bird-core-4.6-SNAPSHOT.jar
elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
The jar file which include the protoc compiled .class file.
And I flow Protocol Buffer java tutorial create my data "testbook".
And I
use hdfs dfs -mkdir /protobuf_data to create HDFS folder.
Use hdfs dfs -put testbook /protobuf_data to put "testbook" to HDFS.
Then I follow elephant-bird web page to create table, syntax is like this:
create table addressbook
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/protobuf_data/';
All worked.
But when I submit the query select * from addressbook; no result came out.
And I couldn't find any logs with errors to debug.
Could someone help me ?
Many thanks
The problem had been solved.
First I put protobuf binary data directly into HDFS, no result showed.
Because it doesn't work that way.
After asking some senior colleagues, they said protobuf binary data should be written into some kind of container, some file format, like hadoop SequenceFile etc.
The elephant-bird page had written the information too, but first I couldn't understand it completely.
After writing protobuf binary data into sequenceFile, I can read the protobuf data with hive.
And because I use sequenceFile format, so I use the create table syntax:
inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
Hope it can help others who are new to hadoop, hive, elephant too.

HDFS file compression internally

I am looking for a default compression in HDFS. I saw this but I don' t want my files to have gzip like extensions(in fact, they should be accesible as if they didn' t compressed) Actually, what I am looking for is exactly like the option "Compress contents to save disk space" on Windows. This option compresses the files internally, but they can be accessed just like usual files. Any ideas will be helpful.
Thanks
This doesn't exist in standard HDFS implementations and you have to manage it yourself. You have to manage your own compression. However, a proprietary implementation of Hadoop, MapR, does this, if solving this problem is important enough for you.
After using hadoop for a little while this doesn't really bother me anymore. Pig and MapReduce and such handle the compression automatically enough for me. I know that's not a real answer, but I couldn't tell in your question if you are simply annoyed or you have a real problem this is causing. Getting use to adding | gunzip to everything didn't take long. I For example:
hadoop fs -cat /my/file.gz | gunzip
cat file.txt | gzip | hadoop fs -put - /my/file.txt.gz
When you're using compressed files you need to think about having them splittable - i.e. can Hadoop split this file when running a map reduce (if the file is not splittable it will only be read by a single map)
The usual way around this is to use a container format e.g. sequence file, orc file etc. where you can enable compression. If you are using simple text files (csv etc) - there's an lzo project by twitter but I didn't use it personally
The standard way to store files with compression in HDFS is through default compression argument while writing any file into HDFS. This is available in mapper libraries, sqoop, flume , hive , hbase catalog and so on. I am quoting some examples here from Hadoop. Here you dont need to worry about compressing the file locally for efficiency in hadoop. Its best to default hdfs file format option to perform this work. This type of compression will smoothly integrate with the hadoop mapper processing.
Job written through Mapper Library
While creating the writer in your mapper program. Here is the definition. You will write your own mapper and reducer to write the file into HDFS with your codec defined as a argument to the Writer method.
createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType **compressionType**, CompressionCodec codec)
Sqoop Import
Below option send default compression argument for the file import into HDFS
sqoop import --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs/ --compress
with sqoop you can also specify specific codec as well with option
sqoop --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs compression-codec org.apache.hadoop.io.compress.SnappyCodec
Hive import
Below example, you can use your desired option to read the file into hive. This again is property you can set while reading from your local file.
SET hive.exec.compress.output=true;
SET parquet.compression=**SNAPPY**; --this is the default actually
CREATE TABLE raw (line STRING) STORED AS PARQUET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log' INTO TABLE raw;
I have not mentioned all the example of methods of data compression while you import into HDFS.
HDFS CLI doesn't (for e.g hdfs dfs -copyFromLocal) provide any direct way to compress. This is my understanding of working with hadoop CLI.

Resources