JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0) ---While Tuning gpt2.finetune - utf-8

Hope you all are doing good ,
I am working on fine tuning GPT 2 model to generate Title based on the content ,While working on it ,I have created a simple CSV files containing only the title to train the model , But while inputting this model to GPT 2 for fine tuning I am getting the following ERROR ,
JSONDecodeError Traceback (most recent call last)
in ()
10 steps=1000,
11 save_every=200,
---> 12 sample_every=25) # steps is max number of training steps
13
14 # gpt2.generate(sess)
3 frames
/usr/lib/python3.7/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 if s.startswith('\ufeff'):
337 s = s.encode('utf8')[3:].decode('utf8')
--> 338 # raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
339 # s, 0)
340 else:
JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
Below is my code for the above :
import gpt_2_simple as gpt2
model_name = "120M" # "355M" for larger model (it's 1.4 GB)
gpt2.download_gpt2(model_name=model_name) # model is saved into current directory under /models/117M/
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
'titles.csv',
model_name=model_name,
steps=1000,
save_every=200,
sample_every=25) # steps is max number of training steps
I have tried all the basic mechanism of handing UTF -8 BOM but did not find any luck ,Hence requesting your help .It would be a great help from you all .

Try changing the model name because i see you input 120M and the gpt2 model is called 124M

Related

Could ruamel.yaml support type descriptor like "num: !!float 4"?

I am learning using ruamel.yaml, and I am wondering whether it supports type descriptor as the original YAML like "num: !!float 4"?
The file is like:
num: !!float 4
I tried import a file like this, but met an error:
---------------------------------------------------------------------------<br>
ValueError Traceback (most recent call last)
Input In [22], in <cell line: 2>()
1 from ruamel import yaml
2 with open("net.yaml", "r", encoding="utf-8") as yaml_file:
----> 3 yaml_dict = yaml.round_trip_load(yaml_file)
4 yaml_dict
...
File ~/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/ruamel/yaml/constructor.py:1469, in RoundTripConstructor.construct_mapping(self, node, maptyp, deep)
1462 if not isinstance(key, Hashable):
1463 raise ConstructorError(
1464 'while constructing a mapping',
1465 node.start_mark,
1466 'found unhashable key',
1467 key_node.start_mark,
1468 )
-> 1469 value = self.construct_object(value_node, deep=deep)
1470 if self.check_mapping_key(node, key_node, maptyp, key, value):
1471 if key_node.comment and len(key_node.comment) > 4 and key_node.comment[4]:
File ~/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/ruamel/yaml/constructor.py:146, in BaseConstructor.construct_object(self, node, deep)
142 # raise ConstructorError(
143 # None, None, 'found unconstructable recursive node', node.start_mark
144 # )
145 self.recursive_objects[node] = None
--> 146 data = self.construct_non_recursive_object(node)
148 self.constructed_objects[node] = data
149 del self.recursive_objects[node]
File ~/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/ruamel/yaml/constructor.py:181, in BaseConstructor.construct_non_recursive_object(self, node, tag)
179 constructor = self.__class__.construct_mapping
180 if tag_suffix is None:
--> 181 data = constructor(self, node)
182 else:
183 data = constructor(self, tag_suffix, node)
File ~/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/ruamel/yaml/constructor.py:1271, in RoundTripConstructor.construct_yaml_float(self, node)
1259 return ScalarFloat(
1260 sign * float(value_s),
1261 width=width,
(...)
1268 anchor=node.anchor,
1269 )
1270 width = len(value_so)
-> 1271 prec = value_so.index('.') # you can use index, this would not be float without dot
1272 lead0 = leading_zeros(value_so)
1273 return ScalarFloat(
1274 sign * float(value_s),
1275 width=width,
(...)
1279 anchor=node.anchor,
1280 )
ValueError: substring not found
Why do I get this error, and how do I get rid of it?
That is a bug in ruamel.yaml<=0.17.21. The comment on the offending line (1271) says
# you can use index, this would not be float without dot
Obviously the author of that comment didn't know what he was talking about, as in your case, when using !!float 4 you have a float without a dot...
It is trivial to "fix" that by replacing index with find in line 1271, and when doing so that will load your document and you can dump the data.
But the corresponding representer for dumping doesn't cope with that outputs the float as 4.0, dropping the tag.
You could temporarily fix this by registering a simpler float constructor (e.g. the simple one from the SafeLoader), although this will affect all floats:
import sys
import ruamel.yaml
yaml_str = """\
num: !!float 4
"""
yaml = ruamel.yaml.YAML()
yaml.constructor.add_constructor(
'tag:yaml.org,2002:float', ruamel.yaml.constructor.SafeConstructor.construct_yaml_float
)
data = yaml.load(yaml_str)
yaml.dump(data, sys.stdout)
which gives:
num: 4.0

Read image file pixel value and write it to a data frame, saved it to csv, when I reload, Image can't show

First I open images and get their pixel value, then write to a dataframe.
skin_df_balanced['pix'] = skin_df_balanced['path'].map(lambda x: np.asarray(Image.open(x).resize((SIZE,SIZE))))
Then I saved this dataframe to csv:
skin_df_balanced.to_csv('./skin_df_balanced_pixel_value_500.csv')
But when I reload the csv and tried to plot the images:
test = pd.read_csv('./skin_df_balanced_pixel_value_500.csv',index_col=0)
n_samples = 5
fig, m_axs = plt.subplots(7, n_samples, figsize = (4*n_samples, 3*7))
for n_axs, (type_name, type_rows) in zip(m_axs, test.sort_values(['type']).groupby('type')):
n_axs[0].set_title(type_name)
for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=1234).iterrows()):
c_ax.imshow(c_row['pix'])
c_ax.axis('off')
I found this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-abe099827f62> in <module>()
5 n_axs[0].set_title(type_name)
6 for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=1234).iterrows()):
----> 7 c_ax.imshow(c_row['pix'])
8 c_ax.axis('off')
4 frames
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py in set_data(self, A)
692 not np.can_cast(self._A.dtype, float, "same_kind")):
693 raise TypeError("Image data of dtype {} cannot be converted to "
--> 694 "float".format(self._A.dtype))
695
696 if not (self._A.ndim == 2
TypeError: Image data of dtype <U629 cannot be converted to float

Encapsulated poscript max line limit

I think that I'm exceeding EPS max line limits:
I'm generating an eps programmatically that consists of a grid of pictures.
My EPS has this structure:
%!PS-Adobe-3.0 EPSF-3.0
.
.
%%BeginProlog
%%EndProlog
%%Page: 1 1
%%Begin Raster Image. Index: 0
save
449 2576 translate
0 rotate
-282 -304 translate
[1 0 0 1 0 0] concat
0 0 translate
[1 0 0 1 0 0] concat
0 0 translate
userdict begin
DisplayImage
0 0
564 608
12
564 608
0
0
FBDBB9FBDCBCFDDBBAFFD8B2FFD7A9FED4A1FCD29CFDD09EFED0A2FFD0A6FFCDA3FFCBA0FFCBA0...
EED79CEBD09CEDD19EEED2A1EFD3A3F0D4A5F0D4A6F0D4A7F1D4A4F3D4A0F3D49F
end
restore
%%End Raster Image
%%Begin Raster Image. Index: 1
.
.
end
restore
%%End Raster Image
%%Begin Raster Image. Index: 2
etc
So the thing is if I write up to 4 images to the EPS, everything works fine, but when I try to write a 5th, the eps won't open on any EPS viewer including Adobe Illustrator (the operation cannot complete because of an unknown error).
I tried using different images to make sure that the particular images were ok and I got the same result, as long as I'm writing 4 images (105825 lines file) everything works. But when I use 5 (132253 lines file) it fails.
Is it possible that I'm exceeding the maximum line limit for EPS?
This are the files in question if you'd like to analyze them:
the one that works - >https://files.fm/u/bfn2d32m and the one that doesn't ->
https://files.fm/u/4gbybr3y
There's no 'line limit' in PostScript or EPS, so you can't be hitting that.
When I run your file through Ghostscript it throws an error /undefined in yImage (I'd suggest you debug PostScript using a proper PostScript interpreter, not Adobe Illustrator).
This suggests to me that one of your images is using more data than you have supplied, so the interpreter runs off the end of the data, consuming parts of the program, until it has read sufficient bytes from currentfile to satisfy the request. At that point is starts processing the file as PostScript again, but the file pointer now points to the 'yImage' of the next 'DisplayImage'. Since you haven't defined a 'yImage' key, naturally this gives you an 'undefined' error.
From your description, this would seem likely to be the 4th image, since adding the 5th throws the error. Note that if your program terminates without supplying enough data (so the interpreter reaches EOF) then the data supplied will be drawn. So it might look like your 4th image is correct even when it isn't, provided it isn't followed by any further program code.
A style note; PostScript is a stack-based language, so normally one would proceed by pushing values onto the stack and reading them from there, instead of executing the 'token' operator.
So your input would be more like :
0 0
564 608
12
564 608
0
0
DisplayImage
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
And the DisplayImage code would be :
/DisplayImage
{
%
% Display a DirectClass or PseudoClass image.
%
% Parameters:
% x & y translation.
% x & y scale.
% label pointsize.
% image label.
% image columns & rows.
% class: 0-DirectClass or 1-PseudoClass.
% compression: 0-none or 1-RunlengthEncoded.
% hex color packets.
%
gsave
/buffer 512 string def
/byte 1 string def
/color_packet 3 string def
/pixels 768 string def
/compression exch def
/class exch def
/rows exch def
/columns exch def
/pointsize exch def
scale
translate
This avoids you having to use token at all for the scale and translate operations for instance.

Wierd output characters (Chinese characters) when using Ruby to read / write CSV

I'm trying to print the first 5 lines from a set of large (>500MB) csv files into small headers in order to inspect the content more easily.
I'm using Ruby code to do this but am getting each line padded out with extra Chinese characters, like this:
week_num type ID location total_qty A_qty B_qty count਍㌀㐀ऀ猀漀爀琀愀戀氀攀ऀ㄀㤀㜀ऀ䐀䔀开伀渀氀礀ऀ㔀㐀㜀㈀ ㌀ऀ㔀㐀㜀㈀ ㌀ऀ ऀ㤀㄀㈀㔀㌀ഀ
44 small 14 A 907859 907859 0 550360਍㐀㄀ऀ猀漀爀琀愀戀氀攀ऀ㐀㈀㄀ऀ䐀䔀开伀渀氀礀ऀ㌀ ㈀㄀㜀㐀ऀ㌀ ㈀㄀
The first few lines of input file are like so:
week_num type ID location total_qty A_qty B_qty count
34 small 197 A 547203 547203 0 91253
44 small 14 A 907859 907859 0 550360
41 small 421 A 302174 302174 0 18198
The strange characters appear to be Line 1 and Line 3 of the data.
Here's my Ruby code:
num_lines=ARGV[0]
fh = File.open(file_in,"r")
fw = File.open(file_out,"w")
until (line=fh.gets).nil? or num_lines==0
fw.puts line if outflag
num_lines = num_lines-1
end
Any idea what's going on and what I can do to simply stop at the line end character?
Looking at input/output files in hex (useful suggestion by #user1934428)
Input file - each character looks to be two bytes.
Output file - notice the NULL (00) between each single byte character...
Ruby version 1.9.1
The problem is an encoding mismatch which is happening because the encoding is not explicitly specified in the read and write parts of the code. Read the input csv as a binary file "rb" with utf-16le encoding. Write the output in the same format.
num_lines=ARGV[0]
# ****** Specifying the right encodings <<<< this is the key
fh = File.open(file_in,"rb:utf-16le")
fw = File.open(file_out,"wb:utf-16le")
until (line=fh.gets).nil? or num_lines==0
fw.puts line
num_lines = num_lines-1
end
Useful references:
Working with encodings in Ruby 1.9
CSV encodings
Determining the encoding of a CSV file

Error messages when compiling old fortran code

I am trying to compile a code that hasn't been developed with the newest standards in mind. I am using the gfortran compiler and get a few errors that I can't correct.
A partner of mine uses the Intel Visual Fortran compiler in MS Visual Studio and doesn't get any errors. Below are the relevant sections of code and their errors. Insight into any of these would be greatly appreciated.
Error 1:
flow.for:63.63:
3 *gjmh_old(j)*abs(gjmh_old(j))/2.0d0/dens_fjmh(j) )
1
Error: Expected a right parenthesis in expression at (1)
Corresponding code:
sum_dpjmh_old= 1.0/2.0*(dens_jmh(j)+dens_old(j))*grav*(dzc(j)/2.0d0) !!!!
1 + ( gavg_old(j)*gavg_old(j)/dens_mom(j)
1 -gjmh_old(j)*gjmh_old(j)/dmom_jmh(j) )
3 + 1.0/2.0*
3 ( fric(j)*dzc(j)/2.0/dhyd(j)
3 *gavg_old(j)*abs(gavg_old(j))/2.0d0/dens_fric(j)
3 +fjmh(j)*dzc(j)/2.0/dhyd(j)
3 *gjmh_old(j)*abs(gjmh_old(j))/2.0d0/dens_fjmh(j) )
Error 2:
flow.for:78.13:
1 + ( gavg(j)*gavg(j)/dens_momn(j)
1
Error: Missing exponent in real number at (1)
Corresponding Code:
sum_dpjmh = 1.0/2.0*(dens_jmhn(j)+dens_ch(j))*grav*(dzc(j)/2.0d0) !!!!
1 + ( gavg(j)*gavg(j)/dens_momn(j)
1 -gjmh(j)*gjmh(j)/dmom_jmhn(j) )
3 + 1.0/2.0*
3 ( fric(j)*(dzc(j)/2.0)/dhyd(j)
3 *gavg(j)*abs(gavg(j))/2.0d0/dens_ch(j)
3 + fjmh(j)*(dzc(j)/2.0)/dhyd(j)
3 *gjmh(j)*abs(gjmh(j))/2.0d0/dens_jmhn(j) )
Error 3:
flow.for:684.72:
3 /dens_fjmh(j) )
1
Error: Syntax error in argument list at (1)
Corresponding code:
gjmh(j) = gjmh_old(j) + dt/dzc(j)*(
1 phi*2.0d0*(pjmh(j)-p(j))
1 + (1.0-phi)*2.0d0*(pjmh_old(j)-p_old(j))
1 - 2.0d0*( gavg_old(j)*gavg_old(j)/dens_mom(j)
1 -gjmh_old(j)*gjmh_old(j)/dmom_jmh(j) )
4 - dens_jmh(j)*grav*dzc(j)
3 - fjmh(j)*dzc(j)/2.0d0/dhyd(j)*gjmh_old(j)*abs(gjmh_old(j))
3 /dens_fjmh(j) )
It seems your code is not well formatted. If you are trying to use fixed format, you should follow the request of fixed format, as http://nf.nci.org.au/training/FortranBasic/slides/slides.005.html.
Or following instructions in http://en.wikipedia.org/wiki/Fortran#Fixed_layout_and_punched_cards:
Column 1 contains *, ! or C for comments.
Column 6 for continuation.
Another way is to convert fixed format to free format, but since your code is not well formatted, most converter may be not working well.

Resources