Read protocol buffer files in apache beam - protocol-buffers

I have a bunch of protobuff files in GCS and I would like to process them through dataflow (java sdk) and I am not sure how to do that.
Apache beam provides AvroIO to read avro files
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
p.apply(AvroIO.readGenericRecords(schema)
.from("gs://my_bucket/path/to/records-*.avro"));
Is there anything similar for reading protobuff files?
Thanks in advance

Related

Best way to automatate getting data from Csv files to Datalake

I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).
I would like to know the best way to automate this process ( using java or hadoops tools )
I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc
There's a list of Kafka Connect connectors here.
This blog series walks through the basics:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Little to no coding required? In no particular order
Talend Open Studio
Streamsets Data Collector
Apache Nifi
Assuming you can setup a Kafka cluster, you can try Kafka Connect
If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie
If you don't need the raw HDFS data, you can load directly into HBase

How to convert parquet file to Avro file?

I am new to hadoop and Big data Technologies. I like to convert a parquet file to avro file and read that data. I search in few forums and it suggested to use AvroParquetReader.
AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();
But I am not sure how to include AvroParquetReader. I am not able
to import it at all.
I can read this file using spark-shell and may be convert it to some JSON
and then that JSON can be converted to avro. But I am looking for a
simpler solution.
If you are able to use Spark DataFrames, you will be able to read the parquet files natively in Apache Spark, e.g. (in Python pseudo-code):
df = spark.read.parquet(...)
To save the files, you can use the spark-avro Spark Package. To write the DataFrame out as an avro, it would be something like:
df.write.format("com.databricks.spark.avro").save("...")
Don't forget that you will need to include the right version of the spark-avro Spark Package with your version of your Spark cluster (e.g. 3.1.0-s2.11 corresponds to spark-avro package 3.1 using Scala 2.11 which matches the default Spark 2.0 cluster). For more information on how to use the package, please refer to https://spark-packages.org/package/databricks/spark-avro.
Some handy references include:
Spark SQL Programming Guide
spark-avro Spark Package.

File formats supported by flume

I am using flume to collect logs.
What are the other formats that flume supports apart from logs like excel, CSV and word doc etc. or it is limited to logs?
Can flume connect to relation database?
Regards
Chhaya
If you want to import/export data to/from hadoop then Sqoop should be your choice. For flume, the data us just a set of bytes. So you could essentially support any format.

Flume: Send files to HDFS via APIs

I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.

Twitter - Hadoop Data Streaming

How do we get the twitter(Tweets) into HDFS for offline analysis. we have a requirement to analyze tweets.
I would look for solution in well developed area of streaming logs into hadoop, since the task looks somewhat similar.
There are two existing systems doing so:
Flume: https://github.com/cloudera/flume/wiki
And
Scribe: https://github.com/facebook/scribe
So your task will be only to pull data from twitter, what I asume is not part of this question and feed one of these systems with this logs.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS.
Fluentd + Hadoop: Instant Big Data Collection
Also by using fluent-plugin-twitter, you can collect Twitter streams by calling its APIs. Of course you can create your custom collector, which posts streams to Fluentd. Here's a Ruby example to post logs against Fluentd.
Fluentd: Data Import from Ruby Applications
This can be a solution to your problem.
Tools to capture Twitter tweets
Create PDF, DOC, XML and other docs from Twitter tweets
Tweets to CSV files
Capture it in any format. (csv,txt,doc,pdf.....etc)
Put it into HDFS.

Resources