PyArrow read_table filter null values - parquet

I'm pretty new to using pyArrow and I'm trying to read a Parquet file but filtering the data I'm loading.
I have an end_time column, and when I try to filter based on some date it's working just fine and I can filter to get only the rows which match my date.
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
last_updated_at = datetime(2021,3,5,21,0,23)
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '>', last_updated_at)])
print(table_ad_sets_ongoing.num_rows)
But I also have sometimes a null value in this end_time field.
So I tried filtering this way
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '=', None)])
print(table_ad_sets_ongoing.num_rows)
But the result is always 0 even if I actually have some rows with this null value.
After some digging I suspect that this has to do with a null_selection_behavior which is by default at 'drop' value and so it skips null values.https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html#pyarrow.compute.filter
I guess I should add this parameter to 'emit_null' but I can't find a way to do it.
Any idea?
Thank you

I finally found out the answer to my question.
Answer come from arrow github (stupid from my side not to have a look at it earlier). https://github.com/apache/arrow/issues/9160
To filter a null field we have to use it this way :
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters=~ds.field("end_time").is_valid())
print(table_ad_sets_ongoing.num_rows)

Related

AWS Lambda Python boto3 reading from dynamodb table with mulitple attibutes in KeyConditionExpression

basicSongsTable has 'artist' as Partition Key and 'song' as sort key.
I am able to read using Query if I have one artist. But I want to read 2 artists with the following code. It gives vague error saying ""errorMessage": "Syntax error in module 'lambda_function': positional argument follows keyword argument (lambda_function.py, line 17)","
import boto3
import pprint
from pprint import pprint
dynamodbclient = boto3.client('dynamodb')
def lambda_handler(event, context):
response = dynamodbclient.query(
TableName ='basicSongsTable',
KeyConditionExpression='artist = :varartistname1', 'artist =:varartistname2',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'},
':varartistname2':{'S': 'sree'}
}
)
pprint(response['Items'])
If I give only one keyconditionexpression it works.
KeyConditionExpression='artist = :varartistname1',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'}
}
Table
As per documentation:
KeyConditionExpression (string) --
The condition that specifies the key values for items to be retrieved
by the Query action.
The condition must perform an equality test on a single partition key
value.
What you are trying to do is, you are trying to perform an equality test on multiple partition key values, which doesn't work.
To do what you want to do, get data for both artists, you will have to either do two queries or do a scan (which I do not recommend).
For other options, I would recommend you to take a look at this answer and its pros and cons.

How to have sorting functionality with Plotly tables?

I have created a data table using Plotly from pandas dataframe. I want to know if there's a way we can add sorting functionality when clicking on the header of the table? My sample code is as below:
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame(data = {'City':['A','C','B'], 'Population':[552,658,423]})
fig= go.Figure(data=[go.Table(header=dict(values=list(df.columns), align='center',font=dict(family='Segoe UI',color='black', size=12)), cells=dict(values=df.transpose().values.tolist(), align='center', font=dict(family='Segoe UI',color='black', size=12)))])
fig.show()
Any help/lead on this will be appreciated! Thanks.

How to filter map in dynamodb on aws console?

I am have a simple table like below on DybamoDB
What I need:
i am trying to filter tools_type attribute which is MAP Type , i want to filter antivirus of this MAP column, but filter option shows only type as string,number,boolean only...how can i filter only antivirus and its value in below example
Note: I need to do filter on awsdynamodb console
What I tried:
Filtering MAP or LIST in web console is not possible. Please use SDK or REST api instead.
Here is an example of applying a filter on a MAP attribute using Python SDK:
>>> import boto3
>>> from boto3.dynamodb.conditions import Key, Attr
>>> dynamodb = boto3.resource('dynamodb')
>>> table = dynamodb.Table('example-ddb')
>>> data = table.scan(
... FilterExpression=Attr('tools_type.antivirus').eq('yes')
... )
>>> data['Items']
[{'pk': '2', 'tools_type': {'antivirus': 'yes'}}]

How to read the arrow parquet key value metadata?

When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.
How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow documentation site.
The metadata is a key-value pair that looks like this
key: "ARROW:schema"
value: "/////5AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAEAAAAyP///wQAAAABAAAAFAAAABAAGAAIAAYABwAMABAAFAAQAAAAAAABBUAAAAA4AAAAEAAAACgAAAAIAAgAAAAEAAgAAAAMAAAACAAMAAgABwA…
as a result of writing this in R
df = data.frame(a = factor(c(1, 2)))
arrow::write_parquet(df, "c:/scratch/abc.parquet")
The schema is base64-encoded flatbuffer data. You can read the schema in Python using the following code:
import base64
import pyarrow as pa
import pyarrow.parquet as pq
meta = pq.read_metadata(filename)
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))

Hbase Scan returning data out of range

I was doing a scan using startRowKey and StopRowKey in HBase scan using HBase shell, but the output what I am receiving is outside the range passed. Please refer the Hbase Query -
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'TableName',{ LIMIT => 2 , STARTROW => '000|9223370554721275807', STOPROW => '101|9223370554727575807', FILTER => SingleColumnValueFilter.new(Bytes.toBytes('col_family'), Bytes.toBytes('col_qualifier'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('Some Value')), COLUMNS => 'col_family:col_qualifier', REVERSED => false}
But the out what is received is outside this range -
016|9223370554960173487
021|9223370555154148992
Please let me know is my search query is correct or what could be the root cause for this?? Any help will be really appreciated.
Thanks
If you put the four rowkeys mentioned in your question in a file and sort them the result will be:
000|9223370554721275807
016|9223370554960173487
021|9223370555154148992
101|9223370554727575807
Thus the values you received are not outside the range of your scan.

Resources