Hadoop Pig ISO Date to Unix Timestamp - hadoop

I have a list of items in Pig consisting of ISO 8601 (YYYY-MM-DD) formatted date strings:
(2011-12-01)
(2011-12-01)
(2011-12-02)
Is there any way to transform these items into UNIX timestamps apart from implementing my own functions in Java?

You need a UDF to do that = The good news it has already been done.
Pig also comes with "piggybank" UDFs contributed by the community including date convert

Related

How to write UDAF for hive using java that accept multiple columns as arguments?

I want to calculate currency_rate based on few inputs like date, var_currecy_code,
fxd_crncy_code.
We have all data in our hive table now we need to calculate currency_rate based on the maximum date and some more inputs as mentioned above by using hive UDAF.
Hive UDF can accept a Tuple as a parameter.
Within the function, you check the length of the tuple, and extract the necessary order for your logic

logstash-elasticsearch: sort data by timestamp

I centralize logfiles into one logfile by using logstash and for each event I have timestamp(the original one).
Now, my last challenge it to get this data sorted by timestamp(if possible on real-time thats better).
my timestamp format is: yyyy-MM-dd HH:mm:ss
Now, I can make any change in the format/ file format in order to make it work, as long as it stays on our servers.
What's the best way to sort my data?
any ideas?
Thanks in advance!

Hadoop Input Formats - Usage

I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.

Hive file formats advantages and disadvantages

I start to work with Hive.
I wanted to know what queries should to use for each table format among formats:
rcfile, orcfile, parquet, delimited text
when you have tables with very large number of columns and you tend to use specific columns frequently, RC file format would be a good choice. Rather than reading the entire row of data you would just retrieve the required columns, thus saving time. The data is divided into groups of rows, which are then divided into groups of columns.
Delimited text file is the general file format.
For ORC file format , have a look at the hive documentation which has a detailed description here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet file format stores data in column form.
eg:
Col1 Col2
A 1
B 2
C 3
Normal data is stored as A1B2C3. Using Parquet, data is stored as ABC123.
For parquet file format , have a read on https://blog.twitter.com/2013/dremel-made-simple-with-parquet
I see that there are a couple of answers but since your question didn't asked for any particular file formats, the answers addressed one or the other file format.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

hive hbase integration timestamp

I would like to store table into HBase using Hive (hive hbase integration )
My table contains a field typed TIMESTAMP (like DATE)
I've done some research and i discovered that TIMESTAMP is not supported by HBASE, some what should I do?
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating dat at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:80)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:529) ... 9 more Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:185)
at org.apache.hadoop.hive.serde2.lazy.LazyTimestamp.init(LazyTimestamp.java:74)
at org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:219)
at org.apache.hadoop.hive.serde2.lazy.LazyStruct.getField(LazyStruct.java:192)
at org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:188)
at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.evaluate(ExprNodeColumnEvaluator.java:98)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:76)
The easiest thing to do would be to convert the TIMESTAMP into a STRING, INT, or FLOAT. This will have the unfortunate side effect of giving up Hive's built in TIMESTAMP support. Due to this you will lose
Read time checks to make sure your column contains a valid TIMESTAMP
The ability to transparently use TIMESTAMPSs of different formats
The use of Hive UDFs which operate on TIMESTAMPs.
The first two losses are mitigated if you choose a single format for your own timestamps and stick to it. The last is not a huge loss because only two Hive date functions actually operate on TIMESTAMPs. Most of them operate on STRINGs. If you aboslutely needed from_utc_timestamp and from_utc_timestamp, you can write your own UDF.
If you go with STRING and only need the date, I would go with a yyyy-mm-dd format. If you need the time as well go with yyyy-mm-dd hh:mm:ss, or yyyy-mm-dd hh:mm:ss[.fffffffff] if you need partial second timestamps. This format also is also consistent with how Hive expects TIMESTAMPs and is the form required for most Hive date functions.
If you with INT you again have a couple of options. If only the date is important, YYYYMMDD fits in with the "basic" format of ISO 8601 (This is a form I've personally used and found convenient when I didn't need to perform any date operations on the column). If the time is also important, go with YYYYMMDDhhmmss. This an acceptable variant for the basic form of ISO 8601 for date time. If you need fractional second timing, then use a FLOAT and the form YYYYMMDDhhmmss.fffffffff. Note that neither of these forms is consitent with how Hive expects integer or floating point TIMESTAMPs.
If the concept of calendar dates and time of day isn't important at all, then using an INT as a Unix timestamp is probably the easiest, or a FLOAT if you also need fractional seconds. This form is consistent with how Hive expects TIMESTAMPs.

Resources