http request with parquet and pyarrow - parquet

I would like to use pyarrow to read/query parquet data from a rest server. At the moment I'm chunking the data, converting to pandas, dumping to json, and streaming the chunks. Like:
p = pq.ParquetDataset('/path/to/data.parquet', filters=filter, use_legacy_dataset=False)
batches = p._dataset.to_batches(filter=p._filter_expression)
(json.dumps(b.to_pandas().values.tolist()) for b in batches)
This is effectively the same work as
ds = pq.ParquetDataset('/path/to/data.parquet',
use_legacy_dataset=False,
filters=filters)
df = ds.read().to_pandas()
data = pd.DataFrame(orjson.loads(orjson.dumps(df.values.tolist())))
without the network io. It's about 50x slower than just reading to pandas directly
df = ds.read().to_pandas()
Is there a way to serialize the parquet dataset to some binary string that I can send over http and parse on the client side?

You can send your data using the arrow in memory columnar format. It will be much more efficient and compact than json. but bear in mind it will be binary data (which unlike json is not human readable)
See the doc for a full example.
In your case you want to do something like this:
ds = pq.ParquetDataset('/path/to/data.parquet',
use_legacy_dataset=False,
filters=filters)
table = ds.read() # pa.Table
# Write the data:
batches = table.to_batches()
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, table.schema)
for batch in batches:
writer.write(batch)
writer.close()
buf = sink.getvalue()
# Read the data:
reader = pa.ipc.open_stream(buf)
read_batches = [b for b in reader]
read_table = pa.Table.from_batches(read_batches)
read_table.to_pandas()

Related

How to upload Query Result from Snowflake to S3 Directly?

I have a Query Interface where the user writes a SQL Query and Gets Result, The warehouse we use is Snowflake to Query Data and display the Queried SQL Result. We use Snowflake JDBC to establish a connection, Asynchronously Queue the Query get a Query ID(UUID) from snowflake and use the Query ID to get status and fetch the Result.
Sample Code:
try {
ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
int numColumns = resultSetMetaData.getColumnCount();
for (int i = 1; i <= numColumns; i++) {
arrayNode.add(objectMapper.createObjectNode().put("name", resultSetMetaData.getColumnName(i))
.put("attribute_number", i)
.put("data_type", resultSetMetaData.getColumnTypeName(i))
.put("type_modifier", (Short) null)
.put("scale", resultSetMetaData.getScale(i)).put("precision",
resultSetMetaData.getPrecision(i)));
}
rootNode.set("metadata", arrayNode);
arrayNode = objectMapper.createArrayNode();
while (resultSet.next()) {
ObjectNode resultObjectNode = objectMapper.createObjectNode();
for (int i = 1; i <= numColumns; i++) {
String columnName = resultSetMetaData.getColumnName(i);
resultObjectNode.put(columnName, resultSet.getString(i));
}
arrayNode.add(resultObjectNode);
}
rootNode.set("results", arrayNode);
// TODO: Instead of returning the entire result string, send it in chunk to S3 utility class for upload
resultSet.close();
jsonString = objectMapper.writeValueAsString(rootNode);
}
As you can see here our use case is we need to send the metadata info(column details) along with the result. The result set is then uploaded to S3 and users are given a S3 link to view the results.
I am trying to figure if this scenario can be handled in Snowflake itself, where snowflake can generate the metadata for the query and upload the result set to a user-defined bucket SO that consumers of Snowflake won't have to do this. I have read about Snowflake Stream, Copy from Stages. Can someone help me understand if this is feasible and if yes how this can be achieved?
Is there any way where I can upload the result of a Query using QueryId from snowflake to S3 directly without fetching and uploading it to S3.
You can store the results in an S3 bucket using the COPY command. This is a simplified example showing the process on a temporary internal stage. For your use case, you would create and use an external stage in S3:
create temp stage FOO;
select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION";
copy into #FOO from (select * from table(result_scan(last_query_id())));
The reason you want to use COPY from a previous select is that the COPY command is somewhat limited in what it can use for the query. By running the query as a regular select first and then running a select * from that result, you get past those limitations.
The COPY command supports other file formats. This way will use the default CSV format. You can also specify JSON, Parquet, or a custom delimited format using a named file format.
https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.
How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow documentation site.
The metadata is a key-value pair that looks like this
key: "ARROW:schema"
value: "/////5AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAEAAAAyP///wQAAAABAAAAFAAAABAAGAAIAAYABwAMABAAFAAQAAAAAAABBUAAAAA4AAAAEAAAACgAAAAIAAgAAAAEAAgAAAAMAAAACAAMAAgABwA…
as a result of writing this in R
df = data.frame(a = factor(c(1, 2)))
arrow::write_parquet(df, "c:/scratch/abc.parquet")
The schema is base64-encoded flatbuffer data. You can read the schema in Python using the following code:
import base64
import pyarrow as pa
import pyarrow.parquet as pq
meta = pq.read_metadata(filename)
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))

Spark: Measure performance of UDF on large dataset

I want to measure performance of an udf on a large dataset. The spark SQL is:
spark.sql("SELECT my_udf(value) as results FROM my_table")
The udf returns an array. The issue I'm facing is how to make this execute without returning the data to the driver. I need an action but anything returning the full data set will crash the driver, eg. collect or I'm not running the calculation for all rows (show/take(n)). So how can i trigger the calculation and not return all data to the driver?
I think the closest you can get to only running your UDF for measuring timing would be something like below. The general idea is using caching to try and remove data loading time from your measurement, and then use a foreach that does nothing to make spark run your UDF.
val myFunc: String => Int = _.length
val myUdf = udf(myFunc)
val data = Seq("a", "aa", "aaa", "aaaa")
val df = sc.parallelize(data).toDF("text")
// Cache to remove data loading from measurements as much as possible
// Also, do a foreach no-op action to force the data to load and cache before our test
df.cache()
df.foreach(row => {})
// Run the test, grabbing before and after time
val start = System.nanoTime()
val udfDf = df.withColumn("udf_column", myUdf($"text"))
// Force spark to run your UDF and do nothing with the result so we don't include any writing time in our measurement
udfDf.rdd.foreach(row => {})
// Get the total elapsed time
val elapsedNs = System.nanoTime() - start

How do I download all the abstract datas from the pubmed data ncbi

I want to download all the pubmed data abstracts.
Does anyone know how I can easily download all of the pubmed article abstracts?
I got the source of the data :
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/af/12/
Is there anyway to download all these tar files..
Thanks in advance.
There is a package called rentrezhttps://ropensci.org/packages/. Check this out. You can retrieve abstracts by specific keywords or PMID etc. I hope it helps.
UPDATE: You can download all the abstracts by passing your list of IDS with the following code.
library(rentrez)
library(xml)
your.ids <- c("26386083","26273372","26066373","25837167","25466451","25013473")
# rentrez function to get the data from pubmed db
fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
rettype = "xml", parsed = T)
# Extract the Abstracts for the respective IDS.
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x)
xmlValue(xmlChildren(x)$Abstract))
# Change the abstract names with the IDS.
names(abstracts) <- your.ids
abstracts
col.abstracts <- do.call(rbind.data.frame,abstracts)
dim(col.abstracts)
write.csv(col.abstracts, file = "test.csv")
I appreciate that this is a somewhat old question.
If you wish to get all the pubmed entries with python I wrote the following script a while ago:
import requests
import json
search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=1800/01/01&maxdate=2016/12/31&usehistory=y&retmode=json"
search_r = requests.post(search_url)
search_data = search_r.json()
webenv = search_data["esearchresult"]['webenv']
total_records = int(search_data["esearchresult"]['count'])
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmax=9999&query_key=1&webenv="+webenv
for i in range(0, total_records, 10000):
this_fetch = fetch_url+"&retstart="+str(i)
print("Getting this URL: "+this_fetch)
fetch_r = requests.post(this_fetch)
f = open('pubmed_batch_'+str(i)+'_to_'+str(i+9999)+".json", 'w')
f.write(fetch_r.text)
f.close()
print("Number of records found :"+str(total_records))
It starts of by making an entrez/eutils search request between 2 dates which can be guaranteed to capture all of pubmed. Then from that response the 'webenv' (which saves the search history) and total_records are retrieved. Using the webenv capability saves having to hand the individual record ids to the efetch call.
Fetching records (efetch) can only be done in batches of 10000, the for loop handles grabbing batches of 9,999 records and saving them in labelled files until all the records are retrieved.
Note that requests can fail (non 200 http responses, errors), in a more robust solution you should wrap each requests.post() in a try/except. And before dumping/using the data to file you should ensure that the http response has a 200 status.

Pig LOADER for SPLUNK like records

I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap

Resources