Spark Streaming from parquet files - spark-streaming

I wanted to know if parquet files can be fed to Spark Streaming exactly like the file stream and directory stream. So an time a new parquet filed is generated the spark should get its lines?

Related

When importing unstructured data - audio and video file - into HDFS, are those files split into block?

When importing unstructured data - audio and video file - into HDFS, are those files splited into 128MB block and saved like other data does?
Blocks are created, yes. It is then up to the client to determine how to re-assemble the files into proper blobs, which is why object/blob storage, such as Apache Ozone, would be recommended for these filetypes, rather than HDFS block storage.

How to read parquet file from s3 bucket in nifi?

I am trying to read parquet file from s3 bucket in nifi.
to read the file I have used processor listS3 and fetchS3Object and then ExtractAttribute processor. till there it looked fine.
the files are in parquet.gz file and by no mean i was able to generate the flowfile from them, My final purpose is to load the file in noSql(SnowFlake).
FetchParquet works with HDFS which we are not used.
My next option is to use executeScript processor (with python) to read these parquet file and save them back to text.
Can somebody please suggest any work around.
It depends what you need to do with the Parquet files.
For example, if you wanted to get them to your local disk, then ListS3 -> FetchS3Object -> PutFile would work fine. This is because this scenario is just moving around bytes and doesn't really matter whether it is Parquet or not.
If you need to actually interpret the Parquet data in some way, which it sounds like you do for getting it into a database, then you need to use FetchParquet and convert from Parquet to some other format like Avro, Json, or Csv, and then send that to one of the database processors.
You can use Fetch/Put Parquet processors, or any other HDFS processors, with s3 by configuring a core-site.xml with an s3 filesystem.
http://apache-nifi-users-list.2361937.n4.nabble.com/PutParquet-with-S3-td3632.html

How flink read 4mc data from HDFS

There are 4mc data on HDFS. When I use Flink
env.readTextFile("hdfs://127.0.0.1:8020/search_logs/4mc")
It can not know the compressed format, so does Flink have the related API to implement to handle this kind of compressed format data?
Thanks.

Can I pull data directly from hive table to H2O?

We have our data stored in hive text files and parquet files is there anyway to load directly from these into H2O or do we have to go through an intermediate step like csv or pandas dataframe?
yes, you can find all the information you need here
H2O currently supports the following file types:
CSV (delimited) files (including GZipped CSV)
ORC
SVMLight
ARFF
XLS
XLSX
Avro version 1.8.0 (without multifile parsing or column type modification)
Parquet
Notes:
ORC is available only if H2O is running as a Hadoop job.
Users can also import Hive files that are saved in ORC format.
When doing a parallel data import into a cluster:
If the data is an unzipped csv file, H2O can do offset reads, so each node in your cluster can be directly reading its part of the csv file in parallel.
If the data is zipped, H2O will have to read the whole file and unzip it before doing the parallel read.
So, if you have very large data files reading from HDFS, it is best to use unzipped csv. But if the data is further away than the LAN, then it is best to use zipped csv.

Hadoop or Spark read tar.bzip2 read

How can I read tar.bzip2 file in spark in parallel.
I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.
So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.
To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

Resources