i am trying to process pst file from apache Pig. Do i need to write User define function to read pst?.How can I process/extracts pst using Pig way Is there any UDF ,library available in java which I can use in my Pig script. Any help would be great-full .
Thanks,
Vamshikrishna
I'm not aware that there is support for .pst files so probably yes, you need to write an UDF and more precisely a LOAD UDF.
Related
I am new to Hadoop and have this confusion. Can you please help?
Q. How Hive, Pig or Impala are used in practical projects? Are they used from command line only or from within Java, Scala etc?
One can use Hive and Pig from the command line, or run scripts written in their language.
Of course it is possible to call(/build) these scripts in any way you like, so you could have a Java program build a pig command on the fly and execute it.
The Hive (and Pig) languages are typically used to talk to a Hive database. Besides this, it is also possible to talk to the hive database via a link (JDBC/ODBC). This could be done directly from anywhere, so you could let a java program make a JDBC connection to talk to your Hive tables.
Within the context of this answer, I belive everything I said about the Hive language also applies to Impala.
Can we use unix shell script instead of using (Java or Python) for User Defined in Apache Pig and Hive?
If it is possible how can we mention in Hive Query or Pig script?
No, you can't use unix shell script as Pig UDF. Pig UDFs are currently supported only in six languages: Java, Jython, Python, JavaScript, Ruby and Groovy.
Please refer this link for more details
http://pig.apache.org/docs/r0.14.0/udf.html
I feel comfortable with loading HCatalog using Pig and was wondering if it's possible to use Spark instead of Pig. Unfortunately, I'm quite new with Spark...
Can you provide any materials on how to start? Are there any Spark libraries to use?
Any Examples? I've made all exercises on http://spark.apache.org/ but they are focusing on RDD and don't go any further..
I will be grateful for any help...
Regards
Pawel
You can use spark SQL to read from Hive Table instead of HCatalog.
https://spark.apache.org/sql/
You can apply same transformations like Pig using Spark Java/Scala/Python language like filter, join, group by..
You can reference the following link for using HCatalog InputFormat wrapper with Spark; which was written prior to SparkSQL.
https://gist.github.com/granturing/7201912
Our systems have loaded both and we can use either. Spark takes on traits of the language you are using, Scala, Python...,. For example using Spark with Python you can utilize many of the libraries of Python within Spark.
Using pig or hadoop streaming, has anyone loaded and uncompressed a zipped file? The original csv file was compressed using pkzip.
Not sure if this helps because its mainly focused on using MapReduce in Java, but there is a ZipFileInputFormat available in hadoop. Its use via the Java API is described here:
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
The main part of this is the ZipFileRecordReader which uses Javas ZipInputStream to process each ZipEntry. The Hadoop reader is probably not going to work for you out of the box because it passes the file path of each ZipEntry as the key and the ZipEntry contents as the value.
I am using MAPI tools (Its microsoft lib and in .NET) and then apache TIKA libraries to process and extract the pst from exchange server, which is not scalable.
How can I process/extracts pst using MR way ... Is there any tool, library available in java which I can use in my MR jobs. Any help would be great-full .
Jpst Lib internally uses: PstFile pstFile = new PstFile(java.io.File)
And the problem is for Hadoop API's we don't have anything close to java.io.File.
Following option is always there but not efficient:
File tempFile = File.createTempFile("myfile", ".tmp");
fs.moveToLocalFile(new Path (<HDFS pst path>) , new Path(tempFile.getAbsolutePath()) );
PstFile pstFile = new PstFile(tempFile);
Take a look at Behemoth (http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.html). It combines Tika and Hadoop.
I've also written by own Hadoop + Tika jobs. The pattern is:
Wrap all the pst files into sequencence or avro files.
Write a map only job that reads the pst files form the avro files and writes it to the local disk.
Run tika across the files.
Write the output of tika back into a sequence file
Hope that help.s
Its not possible to process PST file in mapper. after long analysis and debug it was found out that the API is not exposed properly and those API needs localfile system to store extracted pst contents. It directly cant store on HDFS. thats bottle-neck. And all those API's(libs that extract and process) are not free.
what we can do is extract outside hdfs and then we can process in MR jobs