How to parse pdf in Ruby - ruby

I have been trying a simple Ruby program to parse a simple pdf file and extract the texts I am interested in. I found that pdf-reader is quite good gem for pdf file parsing. I have read through the examples given in that gem and some tutorials around that.
I have tried the callback method and was able to get all the text from my pdf file. But I did not understand the concept behind the arguments for some of the callbacks.
For example, If my pdf has a simple table with 3 columns and 2 rows. (Header row values are Name, Address, Age) and first row values are (Arun, Hoskote, 22) and when U run the a ruby following ruby script
receiver = PDF::Reader::RegisterReceiver.new
reader = PDF::Reader.new("Arun.pdf")
reader.pages.each do |page|
page.walk(receiver)
receiver.callbacks.each do |cb|
puts cb.inspect
end
end
It prints series of callbacks among which some of the interesting callbacks show_text_with_positioning were like following
{:name=>:show_text_with_positioning, :args=>[["N", 5, "am", -4, "e"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ad", 6, "d", 3, "ress"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Age"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ar", 4, "u", 3, "n"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["H", 3, "o", -5, "sk", 9, "o", -5, "te"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["22"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
From the above callbacks, what does args represent with respect to pdf file ? If I want to extract only name value that is 'Arun' (Anything can come here) here or age value i,e '25' (any value can come here) here in this example, how can I do that in ruby program ? Is there any pdf-parser API or Ruby API to get only a single "interested" value(s) from a pdf file ?
How can I write a Ruby program to access a particular callback which I am interested in which gives me the text I wanted ?

If you particularly only want the text, you can do something like this (but probably using a different stream as the destination for the text):
receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.file("Arun.pdf", receiver)
Once you have the text, you could use regular expressions or whatever to get the specific value you want out of it.

Related

Using Hugginface Transformers and Tokenizers with a fixed vocabulary?

I have a special, non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our "language." I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?
To clarify, our "language" uses prefixes to identify certain types of tokens, which have certain functions in the overall syntax. We want to be able to mask by type during inference, both on input and as part of the selection process, for example, by limiting top-k or top-p sampling to a give type. With a fixed/hand-tuned vocabulary we can be very specific about which ids, or how many ids we need; i.e., we know which tokens are used by each type, so we can mask/filter accordingly. However, with BPE tokenization a given type may be tokenized with any number of tokens, making this process much less straightforward.
The motivation is just to make life easier by fitting into the Huggingface universe a little better, so we can experiment with off-the-shelf models more fluently. We already have this working using the standard BertTokenizer with both GPT2 and RoBERTa, but it would be nice to be able to experiment with different Huggingface models "out of the box," so to speak (using Trainers, Pipelines, etc.). With the BertTokenizer we just load our vocab.txt and we're done, so I wondered whether there would be some way of doing this with the other tokenizers (really, the BPE ones are the only issue, at this point).
It seems to me that being able specify a vocab for any tokenizer would be more straightforward than getting our tokenizer working with other models. Though perhaps a better approach would be to look at streamlining that process? I suppose I could fork and modify AutoTokenizer... ??
Any help much appreciated.
As far as I understand the solution below might help you, as you can use this tokenizer, like you would the other pre-trained ones.
As I do not really understand all the inner workings of the tokenizer, I may very well be off with this solution, but hopefully it can help someone.
The main idea is to subclass the PreTrainedTokenizer. This way, you should only override some of the key methods like _tokenize, _convert_token_to_id, etc..., which are more straightforward than implementing a whole new tokenizer.
import json
from pathlib import Path
from typing import Optional, Tuple, Dict
from transformers import PreTrainedTokenizer
class FixedVocabTokenizer(PreTrainedTokenizer):
def __init__(self, vocab: Dict[str, int], max_len: int = None):
super().__init__(max_len=max_len)
self.__token_ids = vocab
self.__id_tokens: Dict[int, str] = {value: key for key, value in vocab.items()}
def _tokenize(self, text: str, **kwargs):
return text.split(' ')
def _convert_token_to_id(self, token: str) -> int:
return self.__token_ids[token] if token in self.__token_ids else self.unk_token_id
def _convert_id_to_token(self, index: int) -> str:
return self.__id_tokens[index] if index in self.__id_tokens else self.unk_token
def get_vocab(self) -> Dict[str, int]:
return self.__token_ids.copy()
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if filename_prefix is None:
filename_prefix = ''
vocab_path = Path(save_directory, filename_prefix + 'vocab.json')
json.dump(self.__token_ids, open(vocab_path, 'w'))
return str(vocab_path),
#property
def vocab_size(self) -> int:
return len(self.__token_ids)
if __name__ == '__main__':
# your custom, fixed vocabulary
custom_vocab = {
'[UNK]': 0,
'word0': 1,
'word1': 2,
'word2': 3,
'word3': 4,
'word4': 5,
'word5': 6,
'[CLS]': 7,
'[SEP]': 8,
'[PAD]': 9
}
model_max_len = 8
tokenizer = FixedVocabTokenizer(custom_vocab, max_len=model_max_len)
# tell your tokenizer about your special tokens
tokenizer.add_special_tokens({
'unk_token': '[UNK]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'sep_token': '[SEP]'
})
res = tokenizer(
[
'word1 word2 word word1 word3',
'word2 word0 word0 word3 word5 word4 word2 word1 word0'
],
padding=True,
truncation=True
)
# the result should look like something like this
# res -> BatchEncoding(
# data: {
# 'input_ids': [[2, 3, 0, 2, 4, 9, 9, 9], [3, 1, 1, 4, 6, 5, 3, 2]],
# 'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]],
# ...
# },
# ...
# )
This is the solution I could come up with, however I could not figure out if you could do something similar with PreTrainedTokenizerFast. So one more note being, that you can only use slow tokenizers using this method.

how to create a set in terraform

I am trying to create a set to use as an argument in the setproduct function in terraform. When I try:
toset([a,b,c])
I get an error saying I can't convert a tuple to a list. I've tried various things like using tolist and ... and just having one pair of () braces in various places but I still can't get this to work - would anyone know how I can create a set from a,b and c?
set must have elements of the same type. Thus it can be:
# set of strings
toset(["a", "b", "c"])
# set of numbers
toset([1, 2, 3])
# set of lists
toset([["b"], ["c", 4], [3,3]])
You can't mix types, so the error you are getting is because your are mixing types, e.g. list and number
# will not work because different types
toset([["b"], ["c", 4], 3])

Setting order method in Sicstus prolog Samsort

I am trying to sort a list of lists such as Books=[[5,1,science,24,3,2018],[6,1,math,24,3,2019],[4,2,science,24,5,2019],[6,2,science,23,3,2019],[3,1,math,24,3,2020]]. I want to order this list based on the 5th value of each element. I tried to use
samsort(sortDateBooks, Books, Output).
sortDateBooks(Book1,Book2):-nth0(5,Book1, Date1),nth0(5,Book2, Date2), Date1<Date2.
The output variable is never filled with data and the original list is also not changed.
I feel that I am not declaring de order predicate properly but can't find any examples.
Thank you for your help.
Well, I noticed I had forgotten to import the samsort library and because of the way it is used no error would be shown. Many thanks to #Reema Q Khan that provided a very usefull workaround and a easy explanation.
I am not sure if this is what you want to do, if yes, then this may give you some hints:
1. Here collect dates will act as findall. It will search for all the years and put them in a list e.g. [2019,2018,2019,2019,2020].
2. sortBook(Sorted) predicate first finds all the Years using collectdates predicate, then sorts them. Notice that in sort I've used #=<, this will not remove any repeated values. You will get [2018,2019,2019,2019,2020].
3. s predicate simply takes each year, searches for the information and puts it in a List.
s predicate will take each year and check through each book, so this may lead to extras. append is used to decrease extra brackets, set predicate simply removes duplicates if any.
sortBook(Sorted):-
Book=[[6,2,science,23,3,2019],[5,1,science,24,3,2018],[6,1,math,24,3,2019],[4,2,science,24,5,2019]
,[3,1,math,24,3,2020]],
collectdates(Book,Clist),
sort(0, #=<, Clist, SList),
s(SList,Book,Sorted1),append(Sorted1,Sorted2),set(Sorted2,Sorted).
collectdates([],[]).
collectdates([H|T],[Last|List]):-
last(H,Last),
collectdates(T,List).
s([],_,[]).
s([H|T],[B|L],[W|List]):-
sortBook1(H,[B|L],W),
s(T,[B|L],List).
sortBook1(_,[],[]).
sortBook1(H,[B|L],[B|List]):-
member(H,B),
sortBook1(H,L,List).
sortBook1(H,[B|L],List):-
\+member(H,B),
sortBook1(H,L,List).
set([],[]).
set([H|T],[H|T2]):-
subtract(T,[H],T3),
set(T3,T2).
Example:
?-sortBook(Sorted).
Sorted = [[5, 1, science, 24, 3, 2018], [6, 2, science, 23, 3, 2019], [6, 1, math, 24, 3, 2019], [4, 2, science, 24, 5, 2019], [3, 1, math, 24, 3, 2020]]
false

Sort ruby middleman array into 3 columns, content in placed in order to display LTR

I am using middleman and have a specific structure that I want my articles to fall into:
.row>
.col-1>article 1, article 4...
.col-2>article 2, article 5...
.col-3>article 3, article 6...
The goal is that the articles read left-to-right, however are stacked in their columns so there is now additional row class that is needed. I have these articles enclosed in a visual card, and want to stack them which is why I have this strange problem to solve.
My question is what is the cleanest ruby way to sort the array into this format? The goal is to use it in a structure like this and be able to account for any number of columns, n.
.row
- page_articles.someProcess(n).each_split(n) do | col |
.column-class
- col.each do | art |
...
Verbatim from group_by doc:
(1..6).group_by{ |i| i%3 }
#=> {0=>[3, 6], 1=>[1, 4], 2=>[2, 5]}
specifically, use the flatten values
page_articles.group_by.with_index{ |a,i| i%3 }.values.flatten(1)

Mocking Sort With Mocha

How can I mock an array's sort expect a lambda expression?
This is a trivial example of my problem:
# initializing the data
l = lambda { |a,b| a <=> b }
array = [ 1, 2, 3, 4, 5 ]
sorted_array = [ 2, 3, 8, 9, 1]
# I expect that sort will be called using the lambda as a parameter
array.expects(:sort).with( l ).returns( sorted_array )
# perform the sort using the lambda expression
temp = array.sort{|a,b| l.call(a,b) }
Now, at first I expected that this would work; however, I got the following error:
- expected exactly once, not yet invoked: [ 1, 2, 3, 4, 5 ].sort(#<Proc:0xb665eb48>)
I realize that this will not work because l is not passed as a parameter to l. However, is there another way to do what this code is trying to accomplish?
NOTE: I have figured out how to solve my issue without figuring out how to do the above. I will leave this open just in case someone else has a similar problem.
Cheers,
Joseph
Mocking methods with blocks can be quite confusing. One of the keys is to be clear about what behaviour you want to test. I can't tell from your sample code exactly what it is that you want to test. However, you might find the documentation for Mocha::Expectation#yields (or even Mocha::Expectation#multiple_yields) useful.

Resources