How to ingest Parquet files residing on AWS S3 into Druid - parquet

I'm very new to Druid and want to know how we can ingest Parquet files on S3 into Druid?
We get data in CSV format and we standardise it to Parquet format in the Data Lake. This then needs to be loaded into Druid.

Instead of trying to ingest parquet files from S3, I streamed data to a Kinesis topic and used that as a source for Druid.

You have to add druid-parquet-extensions in the druid.extensions.loadList in the common.runtime.properties file.
After that you can restart the Druid server.
However, only ingesting a parquet file from local source is documented. I couldn't verify loading from S3 as my files were encrypted.
Try adding the above extension and then read from S3 just like you'd ingest a regular file from S3.

Related

Ingesting parquet files to landing zone

We are working with Azure cloud and we have some pipelines which ingest daily data from sap to azure data lake gen 2. We were working with Azure Data Factory ingesting json and csv files but maybe is better change our approach and ingest parquet files to landing zone and then move to trusted zone using delta tables.
What do you think? is a good idea ingesting raw data to landing in parquet format?

Loading data from S3 to Elasticsearch using AWS glue

I have muliple folders in a S3 bucket and each folder contains one JSON lines file.
I want to do two things with this data
Apply some transformations and get tabular data and save it to some database.
save these json objects, as it is to Elasticseach cluster for full text search
I am using AWS glue for this task and I know how to do 1, but, I can't find any resources that talks about getting data from s3 and storing it to elasticsearch using AWS glue.
Is there a way to do this?
If anyone is looking for an answer to this then I used Logstash to load files to Elasticsearch.

How to connect influxDB to hadoop to store influxdb data into hdfs?

Time series data are stored in the influxdb, but how to further connect influxdb with hadoop hdfs to store all influxdb data into HDFS.
How the data in influxDB can be transferred to hadoop for storage ?
InfluxDB doesn't have any customized tools for HDFS, but there are a few ways to export your data into a text file so that you can then insert the data into HDFS.
The influx_inspect export CLI tool will create a text file in line protocol format. You can adjust formats by using the -format flag in the CLI if you need another type (CSV, JSON, and column are available). Getting that data into HDFS would then (hopefully) be much easier for you.

How to upload large files from HDFS to S3

I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?
For copying data between HDFS and S3, you should use s3DistCp. s3DistCp is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.
For usage of s3DistCp, you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
The code for s3DistCp is available here: https://github.com/libin/s3distcp
If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.
Update: September 2016
I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.

Copying HDFS-format files from S3 to local

We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format.
How can we achieve that? What's the best way?
Normally we could hadoop copyToLocal but we can't access hadoop directly and the data is on S3.

Resources