PyArrow get Metadata from file in S3 - parquet

I want to get Parquet file statistics (such as Min/Max) from file in S3 using PyArrow.
I am able to fetch it using
pq.ParquetDataset(s3_path, filesystem=s3)
and get the statistics if I download and read it using:
ParquetFile(full_path).metadata.row_group(0).column(col_idx).statistics
hope there is a way to achieve it without download the whole file.
Thanks

I came to this post looking for a similar answer few days ago. In the end I found a simple solution that works for me.
import pyarrow.parquet as pq
from pyarrow import fs
s3_files = fs.S3FileSystem(access_key) # whatever need to connect to s3
# fetch the dataset
dataset = pq.ParquetDataset(s3_path, filesystem=s3_files)
metadata = {}
for fragment in dataset.fragments:
meta = fragment.metadata
metadata[fragment.path] = meta
print(meta)
The metadata is store in the dictionary where the keys are the path to the fragment in the s3 and the values is the metadata of that particular fragment.
to acces the statistics just use
meta.row_group(0).column(col_idx).statistics
something like this will be printed for every fragment
<pyarrow._parquet.FileMetaData object at 0x7fb5a045b5e0>
created_by: parquet-cpp-arrow version 8.0.0
num_columns: 6
num_rows: 10
num_row_groups: 1
format_version: 1.0
serialized_size: 3673

Related

Failed to run spark query in databricks notebook after storage configurations

I already set up key vault scope in the notebooks and I established the connection to the storage account using the following steps:
spark.conf.set("fs.azure.account.auth.type."+StorageAccountName+".dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type."+StorageAccountName+".dfs.core.windows.net","org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id."+StorageAccountName+".dfs.core.windows.net",clientId)
spark.conf.set("fs.azure.account.oauth2.client.secret."+StorageAccountName+".dfs.core.windows.net",clientSecret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint."+StorageAccountName+".dfs.core.windows.net","https://login.microsoftonline.com/mytenantid/oauth2/token")
The values of "StorageAccountName", "clientId", "clientSecret" all come from key vault and I am able to get their value properly. In my storage account access control I also assigned the
Storage Blob Data Contributor role to my service principal.
After these configurations, I assigned a connection variable:
var apptable = "abfss://container#"+StorageAccountName+".dfs.core.windows.net/path/to/data"
If I run the following command, I am able to see the files in the blob storage
display(dbutils.fs.ls(apptable))
I am also able to check the schema:
var df = spark.read.format("delta").load(apptable)
df.printSchema()
but if I tried to run the following query:
var last_appt = spark.sql(s"""select max(updateddate) from apptable""").collect()(0).getTimestamp(0)
I got the error:
KeyProviderException: Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key
I researched online and seems there are some issues in the spark configs. But if it failed to get access to the storage, how come the above display command is running well? What could be possibly missing in such scenario?
I have limited experience on databricks. Appreciate any help.
I tried to reproduce the same in my environment and got the below results and I configure same as mentioned above.
Please follow below code:
Read spark dataframe df.
var df = spark.read.format("delta").load(apptable)
Create temp table:
%scala
temp_table_name = "demtb"
df.createOrReplaceTempView(temp_table_name)
Now, using below code. I got this output.
%scala
val aa= spark.sql("""select max(marks) from demtb""")
display(aa)
Update:
As mentioned, in below comment its working fine for me.
df1.write.mode("overwrite").format("parquet").option("path","/FileStore/dd/").option("overwriteschema","true").saveAsTable("app")
And also, you can try this syntax for configuring azure gen2.As per requirement you can change file format. For demo I'm using csv.
spark.conf.set("fs.azure.account.key.<storage_account_name>.dfs.core.windows.net","Access_key")
Scala
%scala
val df1 = spark.read.format("csv").option("header", "true").load("abfss://pool#vamblob.dfs.core.windows.net/")
display(df1)
Python
df1 = spark.read.format("csv").option("header", "true").load("abfss://pool#vamblob.dfs.core.windows.net/")
display(df1)

Spark Dataset Loading multiple CSV files with headers inside a folder and report mismatch in case the headers in all the files are not same

I am trying to load multiple csv files from a hdfs directory into spark DataSet using Spark 2.1.0 APIs:
val csvData = spark.read.option("header", "true").csv("csvdatatest/")
Inside the "csvdatatest" folder there are multiple csv files. Spark is picking header only from the first file and generating this as Schema of the DataSet ignoring the header for remaining csv files. e.g
hadoop fs -ls /user/kumara91/csvdatatest
Found 2 items
/user/kumara91/csvdatatest/controlfile-2017-10-19.csv
/user/kumara91/csvdatatest/controlfile-2017-10-23.csv
hadoop fs -cat /user/kumara91/csvdatatest/controlfile-2017-10-19.csv
Delivery ID,BroadLog ID,Channel,Address,Event type,Event date,File name
hadoop fs -cat /user/kumara91/csvdatatest/controlfile-2017-10-23.csv
Delivery ID,BroadLog ID,Channel,Address,Event type,Event date,File name,dummycolumn
scala> val csvData = spark.read.option("header", "true").csv("csvdatatest/")
csvData: org.apache.spark.sql.DataFrame = [Delivery ID: string, BroadLog ID: string ... 5 more fields]
scala> csvData.schema.fieldNames
res1: Array[String] = Array(Delivery ID, BroadLog ID, Channel, Address, Event type, Event date, File name)
Here, it loaded the header only from the file "controlfile-2017-10-19.csv" and ignored the header with extra column "dummycolumn" in other csv file.
But my requirement is to compare the headers of all the csv files in the folder.
And load the files only if all the CSV files contains the same header. Report mismatch in case and csv file contain more or less or different header
I have the option to do this using the regular hdfs filesystem APIs. And then use Spark APIs. Or other option to Read all the csv files one by one using Spark APIs and do the comparison.
But, I wanted to know if there is any way using the Spark APIs i can achieve without iterating over each of the files.And also, why spark read header from one file and ignores the rest.
There are no ways to read the data correctly without iterating over the files in some way. In big data, file-based data sources are directory-based and the assumption for CSV is that all files in a directory have the same schema. There is no equivalent to .read.option("mergeSchema", true) that exists for JSON sources.
If you want to check just the headers, you will need to process the files one at a time. After you get a listing of all the files, using whichever method you want, the easiest thing to do is grab the headers using something like:
val paths: Seq[String] = ...
val pathsAndHeaders: Seq[(String, String)] = paths.map { path =>
val header = sc.textFile(path).take(1).collect.head
(path, header)
}
A more efficient version, if you have many CSVs is to distribute the paths across the cluster but then you'll have to read the file yourself:
val paths: Seq[String] = ...
val pathsAndHeaders: Seq[(String, String)] = sc.parallelize(paths)
.map { path =>
val header = // read first line of file at path
(path, header)
}
.collect
Now that you have the paths and the headers, do whatever you need. For example, once you group the files into groups with the same header you can pass a sequence of paths to load() to read them in one operation.
It automatically merges, and show the latest schema.
For data missing columns is shown as nulls.
I am using spark version 2.3.1

Transferring my dynamic value from response data to excel file in Jmeter

I want to actually transfer my dynamic value from the response data to Excel file in Jmeter... can anyone plz let me know the clear process for it ?
I used beanshell post processor but dint got the expected output...
Take a look at Apache POI - Java API To Access Microsoft Excel Format Files, this way you will be able to create, read and update Excel files from Beanshell code. The easiest way to add binary documents formats support to JMeter is using Apache Tika, given you have tika-app.jar in JMeter Classpath you will be able to view Excel files contents using View Results Tree listener and use Apache POI API to manipulate Excel files.
Minimal working code for creating an Excel file and adding to it JMeter variable value looks like:
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
Workbook wb = new XSSFWorkbook();
Sheet sheet = wb.createSheet("Sheet1");
Row row = sheet.createRow(0);
Cell cell = row.createCell(0);
cell.setCellValue(vars.get("your_variable"));
FileOutputStream fileOut = new FileOutputStream("FileCreatedByJMeter.xlsx");
wb.write(fileOut);
fileOut.close();
References:
Busy Developers' Guide to HSSF and XSSF Features
How to Extract Data From Files With JMeter

Manually populate an ImageField

I have a models.ImageField which I sometimes populate with the corresponding forms.ImageField. Sometimes, instead of using a form, I want to update the image field with an ajax POST. I am passing both the image filename, and the image content (base64 encoded), so that in my api view I have everything I need. But I do not really know how to do this manually, since I have always relied in form processing, which automatically populates the models.ImageField.
How can I manually populate the models.ImageField having the filename and the file contents?
EDIT
I have reached the following status:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
And this is updating the file reference, using the right value configured in upload_to in the ImageField.
But it is not saving the image. I would have imagined that the first .save call would:
Generate a file name in the configured storage
Save the file contents to the selected file, including handling of any kind of storage configured for this ImageField (local FS, Amazon S3, or whatever)
Update the reference to the file in the ImageField
And the second .save would actually save the updated instance to the database.
What am I doing wrong? How can I make sure that the new image content is actually written to disk, in the automatically generated file name?
EDIT2
I have a very unsatisfactory workaround, which is working but is very limited. This illustrates the problems that using the ImageField directly would solve:
# TODO: workaround because I do not yet know how to correctly populate the ImageField
# This is very limited because:
# - only uses local filesystem (no AWS S3, ...)
# - does not provide the advance splitting provided by upload_to
local_file = os.path.join(settings.MEDIA_ROOT, file_name)
with open(local_file, 'wb') as f:
f.write(data)
instance.image = file_name
instance.save()
EDIT3
So, after some more playing around I have discovered that my first implementation is doing the right thing, but silently failing if the passed data has the wrong format (I was mistakingly passing the base64 instead of the decoded data). I'll post this as a solution
Just save the file and the instance:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
No idea where the docs for this usecase are.
You can use InMemoryUploadedFile directly to save data:
file = cStringIO.StringIO(base64.b64decode(request.POST['file']))
image = InMemoryUploadedFile(file,
field_name='file',
name=request.POST['name'],
content_type="image/jpeg",
size=sys.getsizeof(file),
charset=None)
instance.image = image
instance.save()

Reading in gzipped data from S3 in Ruby

My company has data messages (json) stored in gzipped files on Amazon S3. I want to use Ruby to iterate through the files and do some analytics. I started to use the 'aws/s3' gem, and get get each file as an object:
#<AWS::S3::S3Object:0x4xxx4760 '/my.company.archive/data/msg/20131030093336.json.gz'>
But once I have this object, I do not know how to unzip it or even access the data inside of it.
You can see the documentation for S3Object here: http://amazon.rubyforge.org/doc/classes/AWS/S3/S3Object.html.
You can fetch the content by calling your_object.value; see if you can get that far. Then it should be a question of unpacking the gzip blob. Zlib should be able to handle that.
I'm not sure if .value returns you a big string of binary data or an IO object. If it's a string, you can wrap it in a StringIO object to pass it to Zlib::GzipReader.new, e.g.
json_data = Zlib::GzipReader.new(StringIO.new(your_object.value)).read
S3Object has a stream method, which I would hope behaves like a IO object (I can't test that here, sorry). If so, you could do this:
json_data = Zlib::GzipReader.new(your_object.stream).read
Once you have the unzipped json content, you can just call JSON.parse on it, e.g.
JSON.parse Zlib::GzipReader.new(StringIO.new(your_object.value)).read
For me the below set of steps worked:
Step to read and write the csv.gz from S3 client to local file
Open the local csv.gz file using gzipreader and read csv from it
file_path = "/tmp/gz/x.csv.gz"
File.open(file_path, mode="wb") do |f|
s3_client.get_object(bucket: bucket, key: key) do |gzfiledata|
f.write gzfiledata
end
end
data = []
Zlib::GzipReader.open(file_path) do |gz_reader|
csv_reader = ::FastestCSV.new(gz_reader)
csv_reader.each do |csv|
data << csv
end
end
The S3Object documentation is updated and the stream method is no longer available: https://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html
So, the best way to read data from an S3 object would be this:
json_data = Zlib::GzipReader.new(StringIO.new(your_object.read)).read

Resources