Azure Machine Learning Service - dataset API question - automl

I am trying to use autoML feature of AML. I saw that in the sample notebook it is using Dataset.Tabular.from_delimited_files(train_data) which only takes data from a https path. I am wondering how can I use pandas dataframe directly automl config instead of using dataset API. Alternatively, what is the way I can convert pandas dataframe to tabular dataset to pass into automl config?

You could quite easily save your pandas dataframe to parquet, upload the data to the workspace's default blob store and then create a Dataset from there:
# ws = <your AzureML workspace>
# df = <contains a pandas dataframe>
from azureml.core.dataset import Dataset
os.makedirs('mydata', exist_ok=True)
df.to_parquet('mydata/myfilename.parquet')
dataref = ws.get_default_datastore().upload('mydata')
dataset = Dataset.Tabular.from_parquet_files(path = dataref.path('myfilename.parquet'))
dataset.to_pandas_dataframe()
Or you can just create the Dataset from local files in the portal http://ml.azure.com
Once you created it in the portal, it will provide you with the code to load it, which will look somewhat like this:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset
subscription_id = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
resource_group = 'ignite'
workspace_name = 'ignite'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='IBM-Employee-Attrition')
dataset.to_pandas_dataframe()

Related

How to use tapas table question answer model when table size is big like containing 50000 rows?

I am trying to build up a model in which I load the dataframe (an excel file from Kaggle) and I am using TAPAS-large-finetuned-wtq model to query this dataset. I tried to query 259 rows (the memory usage is 62.9 KB). I didn't have a problem, but then I tried to query 260 rows with memory usage 63.1KB, and I have the error which says: "index out of range in self". I have attached a screenshot for the reference as well. The data I used here can be found from Kaggle datasets.
The code I am using is:
from transformers import pipeline
import pandas as pd
import torch
question = "Which Country code has the quantity 30604?"
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")
c = tqa(table=df[:100], query=question)['cells']
In the last line, as you can see in the screenshot, I get the error.
Please let me know what can be the way I can work for a solution? Any tips would be welcome.

How to change export file to plaintext format in RStudio bibliometric?

library(bibliometrix)
library(xlsx)
importing web of science dataset
web_data<-convert2df("tw.txt")
importing scopus dataset"
scopus_data<-convert2df("ts.bib",dbsource ="scopus",format ="bibtex")
combining both datasets
combined<-mergeDbSources(web_data,scopus_data,remove.duplicated = T)
exporting file
write.xlsx(combined,"combined.xlsx")
I need export combined data to plaintext format for VOSviewer analysis.

How do I save a Huggingface dataset?

How do I write a HuggingFace dataset to disk?
I have made my own HuggingFace dataset using a JSONL file:
Dataset({
features: ['id', 'text'],
num_rows: 18 })
I would like to persist the dataset to disk.
Is there a preferred way to do this? Or, is the only option to use a general purpose library like joblib or pickle?
You can save a HuggingFace dataset to disk using the save_to_disk() method.
For example:
from datasets import load_dataset
test_dataset = load_dataset("json", data_files="test.json", split="train")
test_dataset.save_to_disk("test.hf")
You can save the dataset in any format you like using the to_ function. See the following snippet as an example:
from datasets import load_dataset
dataset = load_dataset("squad")
for split, dataset in dataset.items():
dataset.to_json(f"squad-{split}.jsonl")
For more information look at the official Huggingface script: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/save_load_dataset.ipynb#scrollTo=8PZbm6QOAtGO

How to load new LARGE product data in excel into Hybris?

It is in CSV format into SAP Hybris commerce suite. How to Create product description etc? What steps we need to take?
Solution 1:
Steps:
1. Create an import impex for Product like this:
$catalogVersion=catalogVersion(catalog(id[default=STORE-PRODUCTS]),version[default=Online])[unique=true]
$approvalStatus=approvalStatus(code)[default='approved']
$unit=unit(code)[default=pieces]
INSERT_UPDATE Product;code[unique=true,allownull=true];name[lang=en];ean;onlineDate[dateformat=yyyy/MM/dd];offlineDate[dateformat=yyyy/MM/dd];description[lang=ja];minOrderQuantity;galleryImages($catalogVersion,qualifier);picture($catalogVersion,code);thumbnail($catalogVersion,code);manufacturerAID;manufacturerName;variantType(code);$catalogVersion;$approvalStatus;$unit
Modify the impex as required.
2. Add the entry for your csv file in impex as below:
"#% impex.includeExternalDataMedia( ""Products.csv"" , ""UTF-8"", ';', 1 , -1 );"
3. Zip this impex and csv file together.
4. Use the Import functionality in HMC to import the zipped file.
Make sure your CSV file is changed to the format as required in impex.
Solution 2:
The other preferable solution is to use Hot Folder functionality of Hybris. For more details see this
Solution 3:
Use the IBM MDM tool to create an import profile and upload your csv data. Once your csv data is uploaded, you can create and export profile and export your data in impex file format. This impex can be imported into Hybris through HMC or HAC.
Hope this helps!

Importing shapefile data using GeoDataFrame

I am using GeoDataFrame for the data importing. But have the following problems. Actually this function works well for some shapefiles, but does not work so well for some specific shapefiles and I am wondering why
data = GeoDataFrame.from_file('bayarea_general.shp')
fiona/ogrext.pyx in fiona.ogrext.Iterator.__next__ (fiona/ogrext.c:17244)()
fiona/ogrext.pyx in fiona.ogrext.FeatureBuilder.build (fiona/ogrext.c:3254)()
IndexError: list index out of range

Resources