Bigdata Live data streaming using flume - hadoop

I am trying to analyze twitter data using flume
i got the files from twitter using flume in BigInsights
but the data I received is of compressed Avro schema which is not readable
can anyone tell me a way so that can convert that file to JSON (Readable)
in order to do some analysis on it.
Or is there any way so that the data I receive is already in JSON (Readable) format.
Thanks In Advance.
This is the data i received

Avro format is not designed to be human readable and it's desinged to be consumed by programs. But you have a few options to view this data or even better analyze the data.
Create Hive Table: This option will allow you to analyze data using SQL queries, Spark SQL, Spark notebooks, visualization tools like Tableau and Excel too.
Your table creation script will look like this:
CREATE TABLE twitter_data
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{...
In schema literal, you can define your own schema too.
Write Program: If you are developer and want to/like to wrangle data using programming, you have many languages to choose from to read, parse, convert and write from Avro file to JSON.

Related

Hadoop Beginner - Data Ingestion & Analysis

HDFS stores both structured & unstructured data.HIVE & IMPALA enables us to write sql queries which are then converted to MapReduce. How the user comes to know about the schema in which data is stored or how those tables are formed from data stored In HDFS?
If you use Parquet file format, there are tools for inspecting the file block directly. See this for instance. And most of the hadoop file formats have similar handy tools too, such as https://orc.apache.org/docs/tools.html for orcfiles.

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

How do i use Sqoop to save data in a parquet-avro file format?

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?
parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

How does Hive stores data and what is SerDe?

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.
Is serDe library?
How does hive store data i.e it stores in file or table?
Please can anyone explain the bold sentences clearly?
I'm new to hive!!
Answers
Yes, SerDe is a Library which is built-in to the Hadoop API
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.
For more information on how to write a SerDe read this post
In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java
Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.
Hive can analyse semi structured and unstructured data as well by using
(1) complex data type(struct,array,unions)
(2) By using SerDe
SerDe interface allow us to instruct hive as to how the record should be processed. Serializer will take java object that hive has been working on,and convert it into something that hive can store and Deserializer take binary representation of a record and translate into java object that hive can manipulate.
I think the above has the concepts serialise and deserialise back to front. Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is deserialised from the bit/byte storage format to the structure required by the reader. eg Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write, deserialise on read.

Resources