Huggingface reformer for long document summarization

Huggingface reformer for long document summarization - huggingface-transformers

I understand reformer is able to handle a large number of tokens. However it does not appear to support the summarization task:
>>> from transformers import ReformerTokenizer, ReformerModel
>>> from transformers import pipeline
>>> summarizer = pipeline("summarization", model="reformer")
404 Client Error: Not Found for url: https://huggingface.co/reformer/resolve/main/config.json
...
How would you construct the pipeline "manually" to use reformer for summarization?

Try this:
summarizer = pipeline("summarization", model="google/reformer-enwik8")
via here.
However, this produces...
/lib/python3.7/site-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Related

Streamlit Unhashable TypeError when i use st.cache

when i use the st.cache decorator to cash hugging-face transformer model i get
Unhashable TypeError
this is the code
from transformers import pipeline
import streamlit as st
from io import StringIO
#st.cache(hash_funcs={StringIO: StringIO.getvalue})
def model() :
return pipeline("sentiment-analysis", model='akhooli/xlm-r-large-arabic-sent')

after searching in issues section in streamlit repo
i found that hashing argument is not required , just need to pass this argument
allow_output_mutation = True

This worked for me:
from transformers import pipeline
import tokenizers
import streamlit as st
import copy
#st.cache(hash_funcs={tokenizers.Tokenizer: lambda _: None, tokenizers.AddedToken: lambda _: None})
def get_model() :
return pipeline("sentiment-analysis", model='akhooli/xlm-r-large-arabic-sent')
input = st.text_input('Text')
bt = st.button("Get Sentiment Analysis")
if bt and input:
model = copy.deepcopy(get_model())
st.write(model(input))
Note 1:
calling the pipeline with input model(input) changes the model and we shouldn't change a cached value so we need to copy the model and run it on the copy.
Note 2:
First run will load the model using the get_model function next run will use the chace.
Note 3:
You can read more about Advanced caching in stremlit in thier documentation.
Output examples:

Why is the output file from Biopython not found?

I work with a Mac. I have been trying to make a multiple sequence alignment in Python using Muscle. This is the code I have been running:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="testunaligned.fasta", out="testunaligned.aln", clwstrict=True)
print(cline)
from Bio import AlignIO
align = AlignIO.read(open("testunaligned.aln"), "clustal")
print(align)
I keep getting the following error:
FileNotFoundError: [Errno 2] No such file or directory: 'testunaligned.aln'
Does anyone know how I could fix this? I am very new to Python and computer science in general, and I am totally at a loss. Thanks!

cline in your code is an instance of MuscleCommandline object that you initialized with all the parameters. After the initialization, this instance can run muscle, but it will only do that if you call it. That means you have to invoke cline()
When you simply print the cline object, it will return a string that corresponds to the command you can manually run on the command line to get the same result as when you invoke cline().
And here the working code:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(
input="testunaligned.fasta",
out="testunaligned.aln",
clwstrict=True
)
print(cline)
cline() # this is where mucle runs
from Bio import AlignIO
align = AlignIO.read(open("testunaligned.aln"), "clustal")
print(align)

How to correct TypeError with choices() missing 1 required positional argument: 'population'

I want list of entries to display when calling the random page function but I keep getting this error choices() missing 1 required positional argument: 'population'.

I had this problem due a typo recently.
Aparently you are using the recommended 'secrets' PNG and not 'random' pseudoPNG. I had the following code:
#!/usr/bin/python3
"""
Will get 10 random letters...
from the lowercase abcdefghijklmnopqrstuvwxyz
"""
import string
import secrets
response = secrets.SystemRandom.choices(string.ascii_lowercase, k=10)
print(response)
But secrets.SystemRandom is a class, so I just changed it to secrets.SystemRandom() and the issue was fixed.
Fixed code:
#!/usr/bin/python3
"""
Will get 10 random letters...
from the lowercase abcdefghijklmnopqrstuvwxyz
"""
import string
import secrets
response = secrets.SystemRandom().choices(string.ascii_lowercase, k=10)
print(response)

Pydotplus, Graphviz error: Program terminated with status: 1. stderr follows: 'C:\Users\En' is not recognized as an internal or external command

from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz
from IPython.display import Image
dot_data = export_graphviz(tree,filled=True,rounded=True,class_names=['Setosa','Versicolor','Virginica'],feature_names=['petal length','petal width'],out_file=None)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Program terminated with status:
1. stderr follows: 'C:\Users\En' is not recognized as an internal or external command,
operable program or batch file.
it seems that it split my username into half.How do i overcome this?

I have a very similar example that I'm trying out, it's based on a ML how-to book which is working with a Taiwan Credit Card dataset predicting default risk. My setup is as follows:
from six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
Then creating the decision tree plot is done in this way:
dot_data = StringIO()
export_graphviz(decision_tree=class_tree,
out_file=dot_data,
filled=True,
rounded=True,
feature_names = X_train.columns,
class_names = ['pay','default'],
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
I think it's all coming from the out_file=dot_data argument but cannot figure out where the file path is created and stored as print(dot_data.getvalue()) did not show any pathname.
In my research I came across sklearn.plot_tree() which seems to do everything that the graphviz does. So I took the above exporet_graphviz arguments and were matching arguments were in the .plot_tree method I added them.
I ended up with the following which created the same image as was found in the text:
from sklearn import tree
plt.figure(figsize=(20, 10))
tree.plot_tree(class_tree,
filled=True, rounded=True,
feature_names = X_train.columns,
class_names = ['pay','default'],
fontsize=12)
plt.show()

Vision API: How to get JSON-output

I'm having trouble saving the output given by the Google Vision API. I'm using Python and testing with a demo image. I get the following error:
TypeError: [mid:...] + is not JSON serializable
Code that I executed:
import io
import os
import json
# Imports the Google Cloud client library
from google.cloud import vision
from google.cloud.vision import types
# Instantiates a client
vision_client = vision.ImageAnnotatorClient()
# The name of the image file to annotate
file_name = os.path.join(
os.path.dirname(__file__),
'demo-image.jpg') # Your image path from current directory
# Loads the image into memory
with io.open(file_name, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
# Performs label detection on the image file
response = vision_client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description, label.score, label.mid)
with open('labels.json', 'w') as fp:
json.dump(labels, fp)
the output appears on the screen, however I do not know exactly how I can save it. Anyone have any suggestions?

FYI to anyone seeing this in the future, google-cloud-vision 2.0.0 has switched to using proto-plus which uses different serialization/deserialization code. A possible error you can get if upgrading to 2.0.0 without changing the code is:
object has no attribute 'DESCRIPTOR'
Using google-cloud-vision 2.0.0, protobuf 3.13.0, here is an example of how to serialize and de-serialize (example includes json and protobuf)
import io, json
from google.cloud import vision_v1
from google.cloud.vision_v1 import AnnotateImageResponse
with io.open('000048.jpg', 'rb') as image_file:
content = image_file.read()
image = vision_v1.Image(content=content)
client = vision_v1.ImageAnnotatorClient()
response = client.document_text_detection(image=image)
# serialize / deserialize proto (binary)
serialized_proto_plus = AnnotateImageResponse.serialize(response)
response = AnnotateImageResponse.deserialize(serialized_proto_plus)
print(response.full_text_annotation.text)
# serialize / deserialize json
response_json = AnnotateImageResponse.to_json(response)
response = json.loads(response_json)
print(response['fullTextAnnotation']['text'])
Note 1: proto-plus doesn't support converting to snake_case names, which is supported in protobuf with preserving_proto_field_name=True. So currently there is no way around the field names being converted from response['full_text_annotation'] to response['fullTextAnnotation']
There is an open closed feature request for this: googleapis/proto-plus-python#109
Note 2: The google vision api doesn't return an x coordinate if x=0. If x doesn't exist, the protobuf will default x=0. In python vision 1.0.0 using MessageToJson(), these x values weren't included in the json, but now with python vision 2.0.0 and .To_Json() these values are included as x:0

Maybe you were already able to find a solution to your issue (if that is the case, I invite you to share it as an answer to your own post too), but in any case, let me share some notes that may be useful for other users with a similar issue:
As you can check using the the type() function in Python, response is an object of google.cloud.vision_v1.types.AnnotateImageResponse type, while labels[i] is an object of google.cloud.vision_v1.types.EntityAnnotation type. None of them seem to have any out-of-the-box implementation to transform them to JSON, as you are trying to do, so I believe the easiest way to transform each of the EntityAnnotation in labels would be to turn them into Python dictionaries, then group them all into an array, and transform this into a JSON.
To do so, I have added some simple lines of code to your snippet:
[...]
label_dicts = [] # Array that will contain all the EntityAnnotation dictionaries
print('Labels:')
for label in labels:
# Write each label (EntityAnnotation) into a dictionary
dict = {'description': label.description, 'score': label.score, 'mid': label.mid}
# Populate the array
label_dicts.append(dict)
with open('labels.json', 'w') as fp:
json.dump(label_dicts, fp)

There is a library released by Google
from google.protobuf.json_format import MessageToJson
webdetect = vision_client.web_detection(blob_source)
jsonObj = MessageToJson(webdetect)

I was able to save the output with the following function:
# Save output as JSON
def store_json(json_input):
with open(json_file_name, 'a') as f:
f.write(json_input + '\n')
And as #dsesto mentioned, I had to define a dictionary. In this dictionary I have defined what types of information I would like to save in my output.
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 20,
},
{
'type': 'TEXT_DETECTION',
'maxResults': 20,
},
{
'type': 'WEB_DETECTION',
'maxResults': 20,
}]
}]
})

The objects in the current Vision library lack serialization functions (although this is a good idea).
It is worth noting that they are about to release a substantially different library for Vision (it is on master of vision's repo now, although not released to PyPI yet) where this will be possible. Note that it is a backwards-incompatible upgrade, so there will be some (hopefully not too much) conversion effort.
That library returns plain protobuf objects, which can be serialized to JSON using:
from google.protobuf.json_format import MessageToJson
serialized = MessageToJson(original)
You can also use something like protobuf3-to-dict

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Huggingface reformer for long document summarization - huggingface-transformers

Try this: summarizer = pipeline("summarization", model="google/reformer-enwik8") via here. However, this produces... /lib/python3.7/site-packages/sentencepiece.py", line 177, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

Related

Streamlit Unhashable TypeError when i use st.cache

Why is the output file from Biopython not found?

How to correct TypeError with choices() missing 1 required positional argument: 'population'

Pydotplus, Graphviz error: Program terminated with status: 1. stderr follows: 'C:\Users\En' is not recognized as an internal or external command

Vision API: How to get JSON-output

Categories

Resources