I'm pretty new to using pyArrow and I'm trying to read a Parquet file but filtering the data I'm loading.
I have an end_time column, and when I try to filter based on some date it's working just fine and I can filter to get only the rows which match my date.
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
last_updated_at = datetime(2021,3,5,21,0,23)
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '>', last_updated_at)])
print(table_ad_sets_ongoing.num_rows)
But I also have sometimes a null value in this end_time field.
So I tried filtering this way
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters = [('end_time', '=', None)])
print(table_ad_sets_ongoing.num_rows)
But the result is always 0 even if I actually have some rows with this null value.
After some digging I suspect that this has to do with a null_selection_behavior which is by default at 'drop' value and so it skips null values.https://arrow.apache.org/docs/python/generated/pyarrow.compute.filter.html#pyarrow.compute.filter
I guess I should add this parameter to 'emit_null' but I can't find a way to do it.
Any idea?
Thank you
I finally found out the answer to my question.
Answer come from arrow github (stupid from my side not to have a look at it earlier). https://github.com/apache/arrow/issues/9160
To filter a null field we have to use it this way :
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from datetime import datetime
table_ad_sets_ongoing = pq.read_table('export/facebook.parquet', filters=~ds.field("end_time").is_valid())
print(table_ad_sets_ongoing.num_rows)
I am have a simple table like below on DybamoDB
What I need:
i am trying to filter tools_type attribute which is MAP Type , i want to filter antivirus of this MAP column, but filter option shows only type as string,number,boolean only...how can i filter only antivirus and its value in below example
Note: I need to do filter on awsdynamodb console
What I tried:
Filtering MAP or LIST in web console is not possible. Please use SDK or REST api instead.
Here is an example of applying a filter on a MAP attribute using Python SDK:
>>> import boto3
>>> from boto3.dynamodb.conditions import Key, Attr
>>> dynamodb = boto3.resource('dynamodb')
>>> table = dynamodb.Table('example-ddb')
>>> data = table.scan(
... FilterExpression=Attr('tools_type.antivirus').eq('yes')
... )
>>> data['Items']
[{'pk': '2', 'tools_type': {'antivirus': 'yes'}}]
When I save a parquet file in R and Python (using pyarrow) I get a arrow schema string saved in the metadata.
How do I read the metadata? Is it Flatbuffer encoded data? Where is the definition for the schema? It's not listed on the arrow documentation site.
The metadata is a key-value pair that looks like this
key: "ARROW:schema"
value: "/////5AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAEAAAAyP///wQAAAABAAAAFAAAABAAGAAIAAYABwAMABAAFAAQAAAAAAABBUAAAAA4AAAAEAAAACgAAAAIAAgAAAAEAAgAAAAMAAAACAAMAAgABwA…
as a result of writing this in R
df = data.frame(a = factor(c(1, 2)))
arrow::write_parquet(df, "c:/scratch/abc.parquet")
The schema is base64-encoded flatbuffer data. You can read the schema in Python using the following code:
import base64
import pyarrow as pa
import pyarrow.parquet as pq
meta = pq.read_metadata(filename)
decoded_schema = base64.b64decode(meta.metadata[b"ARROW:schema"])
schema = pa.ipc.read_schema(pa.BufferReader(decoded_schema))
I'm using django rest gis to load up leaflet maps, and at the top level of my app I'm looking at a map of the world. The basemap is from Mapbox. I make a call to my rest-api and return an outline of all of the individual countries that are included in the app. Currently, the GeoJSON file that is returned in 1.1MB in size and I have more countries to add so I'd like to reduce the size to improve performance.
Here is an example of the contents:
{"type":"FeatureCollection","features":[{"type":"Feature","geometry":{"type":"MultiPolygon","coordinates":[[[[-64.54916992187498,-54.71621093749998],[-64.43881835937495,-54.739355468749984],[-64.22050781249999,-54.721972656249996],[-64.10532226562495,-54.72167968750003],[-64.054931640625,-54.72988281250001],[-64.03242187499995,-54.74238281249998],[-63.881933593750006,-54.72294921875002],[-63.81542968749997,-54.725097656250014],[-63.83256835937499,-54.76796874999995],[-63.97124023437499,-54.810644531250034],[-64.0283203125,-54.79257812499999],[-64.32290039062497,-54.79648437499999],[-64.45327148437497,-54.84033203124995],[-64.50869140625,-54.83994140624996],[-64.637353515625,-54.90253906250001],
The size of the file is a function the number of points and the precision of those points. I was thinking that the most expedient way to reduce the size, while preserving my original data, would be to reduce the precision of the geom points. But, I'm at a bit of a loss as to how to do this. I've looked through the documentation on github and haven't found any clues.
Is there a field option to reduce the precision of the GeoJSON returned? Or, is there another way to achieve what I'm try to do?
Many thanks.
I ended up simplifying the geometry using PostGIS and then passing that queryset to the serializer. I started with creating a raw query in the model manager.
class RegionQueryset(models.query.QuerySet):
def simplified(self):
return self.raw(
"SELECT region_code, country_code, name, slug, ST_SimplifyVW(geom, 0.01) as geom FROM regions_region "
"WHERE active=TRUE AND region_type = 'Country'"
)
class RegionsManager (models.GeoManager):
def get_queryset(self):
return RegionQueryset(self.model, using=self._db)
def simplified(self):
return self.get_queryset().simplified()
The view is quite simple:
class CountryApiGeoListView(ListAPIView):
queryset = Region.objects.simplified()
serializer_class = CountryGeoSerializer
And the serializer:
class CountryGeoSerializer(GeoFeatureModelSerializer):
class Meta:
model = Region
geo_field = 'geom'
queryset = Region.objects.filter(active=True)
fields = ('name', 'slug', 'region_code', 'geom')
I ended up settling on the PostGIS function ST_SimplifyVW() after running some tests.
My dataset has 20 countries with geometry provided by Natural Earth. Without optimizing, the geojson file was 1.2MB in size, the query took 17ms to run and 1.15 seconds to load in my browser. Of course, the quality of the rendered outline was great. I then tried the ST_Simplify() and ST_SimplifyVW() functions with different parameters. From these very rough tests, I decided on ST_SimplifyVW(geom, 0.01)
**Function Size Query time Load time Appearance**
None 1.2MB 17ms 1.15s Great
ST_Simplify(geom, 0.1) 240K 15.94ms 371ms Barely Acceptable
ST_Simplify(geom, 0.01) 935k 22.45ms 840ms Good
ST_SimplifyVW(geom, 0.01) 409K 25.92ms 628ms Good
My setup was Postgres 9.4 and PostGIS 2.2. ST_SimplifyVW is not included in PostGIS 2.1, so you must use 2.2.
You could save some space by setting the precision with GeometryField during serialization. This is an extract of my code to model the same WorldBorder model defined in geodjango GIS tutorial. For serializers.py:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometryField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# set a custom precision for the geometry field
mpoly = GeometryField(precision=2, remove_duplicates=True)
class Meta:
model = WorldBorder
geo_field = "mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
Defining explicitely the precision with mpoly = GeometryField(precision=2) will do the trick. The remove_duplicates=True will remove identical points generated by truncating numbers. You need to keep the geo_field reference to your geometry field in the Meta class, or the rest framework will not work. This is my views.py code to see the GeoJSON object using ViewSet:
from rest_framework import viewsets, permissions
from .models import WorldBorder
from .serializers import WorldBorderSerializer
class WorldBorderViewSet(viewsets.ModelViewSet):
queryset = WorldBorder.objects.all()
serializer_class = WorldBorderSerializer
permission_classes = (permissions.IsAuthenticatedOrReadOnly, )
However the most effective improvement in saving space is to simplify geometries as described by geoAndrew. Here I calculate on the fly the geometry simplification using serializers:
from rest_framework_gis.serializers import (
GeoFeatureModelSerializer, GeometrySerializerMethodField)
from .models import WorldBorder
class WorldBorderSerializer(GeoFeatureModelSerializer):
# in order to simplify poligons on the fly
simplified_mpoly = GeometrySerializerMethodField()
def get_simplified_mpoly(self, obj):
# Returns a new GEOSGeometry, simplified to the specified tolerance
# using the Douglas-Peucker algorithm. A higher tolerance value implies
# fewer points in the output. If no tolerance is provided, it
# defaults to 0.
return obj.mpoly.simplify(tolerance=0.01, preserve_topology=True)
class Meta:
model = WorldBorder
geo_field = "simplified_mpoly"
fields = (
"id", "name", "area", "pop2005", "fips", "iso2", "iso3",
"un", "region", "subregion", "lon", "lat",
)
The two solutions are different and can't be merged (see how rest_framework.gis.fields is implemented). Maybe simplifing the geometry is the better solution to preserve quality and save space. Hope it helps!
In spark I want to save RDD objects to hive table. I am trying to use createDataFrame but that is throwing
Exception in thread "main" java.lang.NullPointerException
val products=sc.parallelize(evaluatedProducts.toList);
//here products are RDD[Product]
val productdf = hiveContext.createDataFrame(products, classOf[Product])
I am using Spark 1.5 version.
If your Product is a class (not a case class), I suggest you transform your rdd to RDD[Tuple] before creating the DataFrame:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val productDF = products
.map({p: Product => (p.getVal1, p.getVal2, ...)})
.toDF("col1", "col2", ...)
With this approach, you will have the Product attributes as columns in the DataFrame.
Then you can create a temp table with:
productDF.registerTempTable("table_name")
or a physical table with:
productDF.write.saveAsTable("table_name")