Get filename of Record in Hive - hadoop

Is it possible to get the filename of a record in Hive? That would be incredibly helpful for debugging.
In my particular case, I've an incorrect values ​​in a table that is mapped to a folder with > 100 large files. To use grep is very inefficient

HIVE supports virtual columns, for example INPUT__FILE__NAME. It gives the input file's name for a mapper task.
Have a look at the documentation here. It provides some example on how to do this.
Unfortunately, I'm unable to test the same now. Let me know if this is working or not.

Related

How to include column name from oracle source to flat file (destination) in datastage?

I'm a total beginner and working my way to become a good ETL developer and i uses IBM Infosphere Datastage. I'm able to transfer/import data from databases(Oracle) to sequential file(csv) but i wanted to get the columns name?
is there a way to do it ? i don't have anyone that taught me, i just do it by myself
So any idea would be very helpful.
Thanks!
In the sequential file stage, simply set the option "First Line is Column Names = True", compile and run the job then you will see the column names in the csv file.

How does the CONCATENATE in ALTER TABLE command in HIVE works

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.

Sequence File of Objects into Hive

We started with a bunch of data stored in NetCDF files. From there, some Java code was written to create sequence files from the NetCDF files. We don't know much about the original intentions of the code, but we have been able to learn a little bit about the sequence files themselves. Ultimately, we are trying to create tables within Hive using these sequence files, but seem incapable of doing so at the moment.
We know that the keys and values within the sequence files are stored as objects that implements WritableComparable. We are also capable of creating Java code to iterate through all of the data in the sequence files.
So, what would be necessary to actually get Hive to read the data within the objects of these sequence files properly?
Thanks in advanced!
UPDATE: The reason it is so difficult to describe where I am having trouble exactly is because I am not necessarily getting any errors. Hive is simply just reading the sequence files incorrectly. When running the Hadoop -text command on my sequence file I get a list of objects as such:
NetCDFCompositeKey#263c7e3f , NetCDFRecordWritable#4d846db5
The data is within those objects themselves. So, currently from the help of #Tariq I believe what I have to do in order to actually read those objects is to create a custom InputFormat to read the keys and a custom SerDe to serialize and deserialize the objects?
I'm sorry, i'm not able to understand from your question where exactly you are facing the problem. If you wish to use SequenceFiles through Hive you just have to add STORED AS SEQUENCEFILE clause while issuing CREATE TABLE(most probably you already know this, nothing new). When you work on SequenceFiles Hive treats each key/value pair of the SequenceFiles similar to rows in normal files. Important thing here is that keys will be ignored. Apart from that nothing very special.
Having said that, if you wish to read both keys and values, you might have to write a custom InputFormat that can read both keys and values. See this project for example. It allows us to access data stored in a SequenceFile's key.
Also, if your keys and values are custom classes, you will require to write a SerDe as well to serialize and deserialize your data.
HTH
P.S. : I don't know if this is exactly what you were looking for. Do let me know if it is not and add some more detail to your question. I'll try addressing that.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Generate multiple outputs with Hadoop Pig

I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number, and so on...
The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number. So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?).
Any idea?
Thanks
Daniele
While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what you are looking for.
for getting an output(file or anything) you need to assign data to a variable, thats how it works with STORE. If id's are limited and finite you can FILTER them one by one and then STORE them. (I always do that for action types which is about 20-25).
But if you need to get each unique id file badly then make 2 files. 1 with whole data in it grouped by id, 1 with just unique ids. Then try generating 1(or more if you have too many) pig scripts that FILTER BY that id. But it's a bad solution. Assuming you would group 10 ids in a pig script you would have (unique id count/10) pig scripts to run.
Beware that Hdfs ain't good at handling too many small files.
Edit:
A better solution would be to GROUP and SORT by unique id to a big file. Then since its sorted you can easily divide the contents with a 3rd party script.

Resources