Parquet raw file content seems incomplete vs content displayed in databricks - parquet

I'm doing a string search for certain values that I would expect to be in a parquet file. Some are found, others are not found.
Yet when I view the parquet file content from a databricks notebook, I can find a missing value within the data shown.
Approx 70% of the data I search for with the raw parquet file (on windows) is found. But spot checks for some of the remaining data is found via the notebook.
Why is some data present when viewing the raw parquet content and other data isn't? It's also the same data each time which is or isn't found.
Here's a screenshot example. It shows a case where a value from input json IS found in the raw parquet, and a case where a value from input json is not found in raw parquet (but IS found in a csv export of same data from databricks)

Related

ADF force format stored in parquet from copy activity

I've created an ADF pipeline that converts a delimited file to parquet in our datalake. I've added an additional column and set the value using the following expression #convertfromutc(utcnow(),'GMT Standard Time','o'). The problem I am having is when I look at the parquet file it is coming back in the US format.
eg 11/25/2021 14:25:49
Even if I use #if(pipeline().parameters.LoadDate,json(concat('[{"name": "LoadDate" , "value": "',formatDateTime(convertfromutc(utcnow(),'GMT Standard Time','o')),'"}]')),NULL) to try to force the format on the extra column it still comes back in the parquet in the US format.
Any idea why this would be and how I can get this to output into parquet as a proper timestamp?
Mention the format pattern while using convertFromUtc function as shown below.
#convertFromUtc(utcnow(),’GMT Standard Time’,’yyyy-MM-dd HH:mm:ss’)
Added date1 column in additional columns under source to get the required date format.
Preview of source data in mappings. Here data is previewed as giving format in convertFromUtc function.
Output parquet file:
Data preview of the sink parquet file after copying data from the source.

Can parquet, avro and other hadoop file formats have different layout for first line?

Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types? I know writing RDD as these formats is not supported. I was actually trying to write a parquet file with first line containing only the header date and other lines containing the detail records. A sample file layout
2019-04-06
101,peter,20000
102,robin,25000
I want to create a parquet with the above contents. I already have a csv file sample.csv with above contents. The csv file when read as dataframe contains only the first field as the first row has only one column.
rdd = sc.textFile('hdfs://somepath/sample.csv')
df = rdd.toDF()
df.show()
o/p:
2019-04-06
101
102
Could someone please help me with converting the entire contents of rdd into dataframe. Even when i try reading the file directly as a df instead of converting from rdd same thing happens.
Your file only has "one column" in Spark's reader, so therefore the dataframe output will only be that.
You didn't necessarily do anything wrong, but your input file is malformed if you expect there to be more than one column, and if so, you should be using spark.csv() instead of sc.textFile()
Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types?
Because those types need a schema, which RDD has none.
trying to write a parquet file with first line containing only the header date and other lines containing the detail records
CSV file headers need to describe all columns. There cannot be an isloated header above all rows.
Parqeut/Avro/ORC/JSON cannot do not have column headers like CSV, but the same applies.

Page level skip/read in apache parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

Analyzing huge amount of JSON files on S3

I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Amazon EMR JSON

I am using Amazon EMR Hadoop Hive for big data processing. Current data in my log files is in CSV format. In order to make the table from log files, I wrote regex expression to parse the data and store into different columns of external table. I know that SerDe can be used to read data in JSON format and this means that each log file line could be as JSON object. Are there any Hadoop performance advantages if my log files are in JSON format comparing CSV format.
If you can process the output of the table (that you created with the regexp) why do another processing? Try to avoid unnecessary stuff.
I think the main issue here is which format is faster to read. I believe CSV will provide better speed over JSON but don't take my word. Hadoop really doesn't care. It's all byte arrays to him, once in memory.

Resources