How do I save a Huggingface dataset? - huggingface-datasets

How do I write a HuggingFace dataset to disk?
I have made my own HuggingFace dataset using a JSONL file:
Dataset({
features: ['id', 'text'],
num_rows: 18 })
I would like to persist the dataset to disk.
Is there a preferred way to do this? Or, is the only option to use a general purpose library like joblib or pickle?

You can save a HuggingFace dataset to disk using the save_to_disk() method.
For example:
from datasets import load_dataset
test_dataset = load_dataset("json", data_files="test.json", split="train")
test_dataset.save_to_disk("test.hf")

You can save the dataset in any format you like using the to_ function. See the following snippet as an example:
from datasets import load_dataset
dataset = load_dataset("squad")
for split, dataset in dataset.items():
dataset.to_json(f"squad-{split}.jsonl")
For more information look at the official Huggingface script: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/save_load_dataset.ipynb#scrollTo=8PZbm6QOAtGO

Related

How to use tapas table question answer model when table size is big like containing 50000 rows?

I am trying to build up a model in which I load the dataframe (an excel file from Kaggle) and I am using TAPAS-large-finetuned-wtq model to query this dataset. I tried to query 259 rows (the memory usage is 62.9 KB). I didn't have a problem, but then I tried to query 260 rows with memory usage 63.1KB, and I have the error which says: "index out of range in self". I have attached a screenshot for the reference as well. The data I used here can be found from Kaggle datasets.
The code I am using is:
from transformers import pipeline
import pandas as pd
import torch
question = "Which Country code has the quantity 30604?"
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")
c = tqa(table=df[:100], query=question)['cells']
In the last line, as you can see in the screenshot, I get the error.
Please let me know what can be the way I can work for a solution? Any tips would be welcome.

Sorting data frame properly

I have the following data frame (called finaldf) that looks like this. However the dates
are all out of order. How would i be able to sort this data to retain it all but make the dates be in order?
I'm assuming you are using python , and if its a pandas dataframe then
in order to sort data based on 'date' column you can use the following command:
finaldf = finaldf.sort_values(by="date")

Azure Machine Learning Service - dataset API question

I am trying to use autoML feature of AML. I saw that in the sample notebook it is using Dataset.Tabular.from_delimited_files(train_data) which only takes data from a https path. I am wondering how can I use pandas dataframe directly automl config instead of using dataset API. Alternatively, what is the way I can convert pandas dataframe to tabular dataset to pass into automl config?
You could quite easily save your pandas dataframe to parquet, upload the data to the workspace's default blob store and then create a Dataset from there:
# ws = <your AzureML workspace>
# df = <contains a pandas dataframe>
from azureml.core.dataset import Dataset
os.makedirs('mydata', exist_ok=True)
df.to_parquet('mydata/myfilename.parquet')
dataref = ws.get_default_datastore().upload('mydata')
dataset = Dataset.Tabular.from_parquet_files(path = dataref.path('myfilename.parquet'))
dataset.to_pandas_dataframe()
Or you can just create the Dataset from local files in the portal http://ml.azure.com
Once you created it in the portal, it will provide you with the code to load it, which will look somewhat like this:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset
subscription_id = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
resource_group = 'ignite'
workspace_name = 'ignite'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='IBM-Employee-Attrition')
dataset.to_pandas_dataframe()

Importing shapefile data using GeoDataFrame

I am using GeoDataFrame for the data importing. But have the following problems. Actually this function works well for some shapefiles, but does not work so well for some specific shapefiles and I am wondering why
data = GeoDataFrame.from_file('bayarea_general.shp')
fiona/ogrext.pyx in fiona.ogrext.Iterator.__next__ (fiona/ogrext.c:17244)()
fiona/ogrext.pyx in fiona.ogrext.FeatureBuilder.build (fiona/ogrext.c:3254)()
IndexError: list index out of range

Percentile calculation in Pig Latin

I'm trying to calculate percentile using Pig. I need to group data using an attribute and calculate percentiles for each tuple in the group based on sales.
I've seen there is no built in Pig function to do this. Wondering if anyone faced similar problem before can help me.
As JaiPrakash mentioned, you can use the UDF StreamingQuantile from the Apache DataFu library. Since I already have an example ready, I'll just copy it here.
Input
item1,234
item1,324
item1,769
item2,23
item2,23
item2,45
PIG Script
register datafu-1.2.0.jar;
define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0');
data = load 'data' using PigStorage(',') as (item:chararray, value:int);
quantiles = FOREACH (GROUP data by item) GENERATE group, Quantile(data.value);
dump quantiles;
Output
(item1,(234.0,324.0,769.0))
(item2,(23.0,23.0,45.0))

Resources