How to read two-dimensional complex data with pyspark? - hadoop

1) I have a data format like this, it is read by np.fromfile(path_file,dtype=np.complex64)
enter image description here
2)Its data format is named with a raw suffix, which I previously wrote via python's write(path,'wb')
I am passing the data into hdfs, how can I read the data through a kind of rdd, I have used binaryfile to read the data, but the data read out is garbled, please ask how I can read out the data for the kind of format in the figure?
1)sc.binaryfile:It reads out a garbled code
2)sc.textfile:Read out as a string
dataset = np.array([[-1+1j,-1+1j,-1+1j],[-2+2j,-2+2j,-2+2j],[-3+3j,-3+3j,-3+3j],[-4+4j,-4+4j,-4+4j]])
Since there is too much content in the dataset, I approximate my dataset with a 4×3 matrix, which is created with np.array

Related

Get data and read short string from txt-file in SPSS syntax

I would like to use GET DATA to open my data. Then read a string from a text file. The string would be a date (eg. "2017-09-02 13:24") which I would use in filtering the data set before saving as .sav.
Is this possible? Or any other suggestion on how to import external information to use while processing the data set?
With ADD FilE I know its possible to open up two different data sets. However, I have to use GET DATA.
The .sps-file is run from spss job-file.

Pass parameter from spark to input format

We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.

Hadoop Input Formats - Usage

I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.

Hive file formats advantages and disadvantages

I start to work with Hive.
I wanted to know what queries should to use for each table format among formats:
rcfile, orcfile, parquet, delimited text
when you have tables with very large number of columns and you tend to use specific columns frequently, RC file format would be a good choice. Rather than reading the entire row of data you would just retrieve the required columns, thus saving time. The data is divided into groups of rows, which are then divided into groups of columns.
Delimited text file is the general file format.
For ORC file format , have a look at the hive documentation which has a detailed description here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet file format stores data in column form.
eg:
Col1 Col2
A 1
B 2
C 3
Normal data is stored as A1B2C3. Using Parquet, data is stored as ABC123.
For parquet file format , have a read on https://blog.twitter.com/2013/dremel-made-simple-with-parquet
I see that there are a couple of answers but since your question didn't asked for any particular file formats, the answers addressed one or the other file format.
There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.
This Blog Post
This link from MapR [They don't discuss Parquet though]
This link from Inquidia
The above given links will get you going. I hope this answer your query.
Thanks!

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources