Using Spark fileStream with Avro Data Input - hadoop

I'm trying to create a Spark Streaming application using fileStream(). The document guide specified:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
I need to pass KeyClass, ValueClass, InputFormatClass. My main question is what can I use for these parameters for Avro formatted data?
Note: that my Avro data already have schema embedded in the data.
I found a related question here. However theirs input is in Parquet format.

Related

How can I process Avro Container data with different versions of schema?

I have months' worth of data from a single domain stored in HDFS in Avro Container files. Each file has the schema for all the data in that file, of course. How do I process all the data using Hive or Pig? It seems both Hive and Pig need the avsc file of some form of table structure definition up front. i.e. even if I use Avro tools to extract avsc from each file I will have to load each dataset using a different avsc file and I cannot process all of them using one job or DDL + Query.
Isn't it possible for Hive and Pig to pull the avsc at runtime based on the Avro Container file it is processing? Is it already implemented and I'm not finding it or too difficult to implement?

Bigdata Live data streaming using flume

I am trying to analyze twitter data using flume
i got the files from twitter using flume in BigInsights
but the data I received is of compressed Avro schema which is not readable
can anyone tell me a way so that can convert that file to JSON (Readable)
in order to do some analysis on it.
Or is there any way so that the data I receive is already in JSON (Readable) format.
Thanks In Advance.
This is the data i received
Avro format is not designed to be human readable and it's desinged to be consumed by programs. But you have a few options to view this data or even better analyze the data.
Create Hive Table: This option will allow you to analyze data using SQL queries, Spark SQL, Spark notebooks, visualization tools like Tableau and Excel too.
Your table creation script will look like this:
CREATE TABLE twitter_data
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{...
In schema literal, you can define your own schema too.
Write Program: If you are developer and want to/like to wrangle data using programming, you have many languages to choose from to read, parse, convert and write from Avro file to JSON.

How do i use Sqoop to save data in a parquet-avro file format?

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?
parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

Storing avro data in ORC format in HDFS with out using HIVE

Am comparing storing avro data in to ORC and Parquet format,
i got success in storing Avro data into parquet using "com.twitter" % "parquet-avro" % "1.6.0" , but unable to find any information or API to store the avro data in ORC format.
Is that ORC is tightly coupled with Hive only ?
Thanks
subahsh
You haven't said your using Spark, but the question is tagged it, so I assume you are.
The ORC file format is currently heavily tied to the HiveContext in Spark (and I think only available in 1.4 and up), but if you create a hive context, you should be able to write dataframes to ORC files in the same was you can with Parquet, for example:
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val df = sqlContext.read.avro(("/input/path")
df.write.format("orc").save("/path/to/use")
If you're readingthe avro data via the Spark dataframes API, then that's all you should need, but there's more details on the Hortonworks blog

How to store Avro format in HDFS using PIG?

After processing input data, I've a JAVA object. I've created avro schema for storing the object in avro file. I'm stuck at writing the object using schema into HDFS. Can anyone walk me through the process of writing the object using PIG script & corresponding UDF?
I suppose you are using an UDF if you use Java.
So you have just to return the result of your UDF as a pig Tuple.
Then you get a relation with your data ready to store.
Finally You can use the STORE command using AvroStorage.

Resources