How to use tapas table question answer model when table size is big like containing 50000 rows? - huggingface-transformers

I am trying to build up a model in which I load the dataframe (an excel file from Kaggle) and I am using TAPAS-large-finetuned-wtq model to query this dataset. I tried to query 259 rows (the memory usage is 62.9 KB). I didn't have a problem, but then I tried to query 260 rows with memory usage 63.1KB, and I have the error which says: "index out of range in self". I have attached a screenshot for the reference as well. The data I used here can be found from Kaggle datasets.
The code I am using is:
from transformers import pipeline
import pandas as pd
import torch
question = "Which Country code has the quantity 30604?"
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")
c = tqa(table=df[:100], query=question)['cells']
In the last line, as you can see in the screenshot, I get the error.
Please let me know what can be the way I can work for a solution? Any tips would be welcome.

Related

How do I save a Huggingface dataset?

How do I write a HuggingFace dataset to disk?
I have made my own HuggingFace dataset using a JSONL file:
Dataset({
features: ['id', 'text'],
num_rows: 18 })
I would like to persist the dataset to disk.
Is there a preferred way to do this? Or, is the only option to use a general purpose library like joblib or pickle?
You can save a HuggingFace dataset to disk using the save_to_disk() method.
For example:
from datasets import load_dataset
test_dataset = load_dataset("json", data_files="test.json", split="train")
test_dataset.save_to_disk("test.hf")
You can save the dataset in any format you like using the to_ function. See the following snippet as an example:
from datasets import load_dataset
dataset = load_dataset("squad")
for split, dataset in dataset.items():
dataset.to_json(f"squad-{split}.jsonl")
For more information look at the official Huggingface script: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/save_load_dataset.ipynb#scrollTo=8PZbm6QOAtGO

Parquet meta data of "has_dictionary_page" is false but column has "PLAIN_DICTIONARY" encoding

I used parquet of pyarrow ro read the meta data of parquet by this code:
from pyarrow import parquet
p_file = parquet.ParquetFile("v-c000.gz.parquet")
for rg_idx in range(p_file.metadata.num_row_groups):
rg = p_file.metadata.row_group(rg_idx)
for col_idx in range(rg.num_columns):
col = rg.column(col_idx)
print(col)
and got in the output: has_dictionary_page: False (for all the row group)
but according to my checks all the column chanks in all of row group are PLAIN_DICTIONARY encoded. furthermore I checked statistics about the dictionary and saw all the key and value over it. attaching part of it:
How is that possible that there is no dictionary page?
My best guess is that you are running into PARQUET-1547 which is described a bit more in this question.
In summary, some parquet readers did not write the dictionary_page_offset field correctly. Those parquet readers have workarounds in place to recognize the invalid write. However, parquet-cpp (which is used by pyarrow) does not have such a workaround in place.

Can't seemed to import google cloud Vertex AI Text Sentiment Analysis Dataset

I am experimenting with google cloud Vertex AI Text Sentiment Analysis. I created a sentiment dataset based on the following reference:
https://cloud.google.com/vertex-ai/docs/datasets/prepare-text#sentiment-analysis
When I created the dataset, I specified that maximum sentiment is 1 to get a range of 0-1. The document indicate that CSV file should have the following format:
[ml_use],gcs_file_uri|"inline_text",sentiment,sentimentMax
So I created a csv file with something like this:
My computer is not working.,0,1
You are really stupid.,1,1
As indicated in the documentation, I need at least 10 entry per sentiment value. I created 11 entries for the value 0 and 1, resulting in 22 entries total. I then uploaded the file and got "Unable to import data due to error", but the error message is blank. There doesn't appear to be errors logged in the log explorer.
I tried importing a text classification model and it imported properly. The imported line looks something like this.
The flowers are very pretty,happy
The grass are dead,sad
What am I doing wrong here for the sentiment data?
OK, the issue appears to be character set related. I had generate the CSV file using Libre Office Calc and exported it as CSV. Out of the box, it appears to default to a western europe character set, which looked fine in my text editor, but apparently caused problems I changed it to UTF-8 and now it's importing my dataset.

Trying to scrape data off of dividendinvestor.com

I'm trying to import some stock data regarding dividend history using Google Sheets.
The data I'm trying to grab is from this page: https://www.dividendinvestor.com/dividend-quote/
(e.g. https://www.dividendinvestor.com/dividend-quote/ibm or https://www.dividendinvestor.com/dividend-quote/msft)
With other sites, I've been able to use a combination of INDEX and IMPORTHTML to get data from a table. For example, if I wanted to get the "Forward P/E" for IBM from finviz.com, I do this:
=index(IMPORTHTML("http://finviz.com/quote.ashx?t=IBM","table", 11),11,10)
That grabs table 11 and goes down 11 rows and over 10 columns to get the piece of data that I want.
However, I cannot seem to find any tables to import via IMPORTHTML from the www.dividendinvestor.com/dividend-quote/ibm site.
I'm trying to import the value to the right of the "Consecutive Dividend Increases" field.
In this case, the output I'm trying to achieve is "19 years".
I've also tried IMPORTXML, but everything I try with XPATH (using this path: "/html/body/div[3]/div/div/div[2]/div/div/div[2]/div[2]/div[2]/span[20]" ) fails too.
Any help out there? The desired end result will be that I will dynamically build the dividendinvestor.com URL by appending a different ticker symbol and have a result of how many years of consecutive increases in their dividend payout.
Nice solution proposed by #player0. If you don't want to use INDEX, you can go with :
=IMPORTXML("https://www.dividendinvestor.com/dividend-quote/"&B3,"//a[.='Consecutive Dividend Increases']/following::span[1]")
Update (May 2022) :
New working formula :
=REGEXEXTRACT(TEXTJOIN("|";TRUE;IMPORTXML("https://www.dividendinvestor.com/ajax/?action=quote_ajax&symbol="&B2;"//text()"));"\d+ Years")
Note : I'm based in Europe, so semi-colons may have to be replaced with commas.
try:
=INDEX(IMPORTXML("https://www.dividendinvestor.com/dividend-quote/ibm/",
"//span[#class = 'data']"), 9, 1)

Reading XML-files with StAX / Kettle (Pentaho)

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!
I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

Resources