I am reading an image from the local file system, converting it to bytes format and finally ingesting the image to tf.train.Feature to convert into TFRecord format. Things are working fine until the moment I read the TFrecord and extract the image bytes format which seems to be a sparse format in the end. Below is my code for the complete process flow.
reading df and image file: No Error
import tensorflow as tf
from PIL import Image
img_bytes_list = []
for img_path in df.filepath:
with tf.io.gfile.GFile(img_path, "rb") as f:
raw_img = f.read()
img_bytes_list.append(raw_img)
defining features : No Error
write_features = {'filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=df['filename'].apply(lambda x: x.encode("utf-8")))),
'img_arr':tf.train.Feature(bytes_list=tf.train.BytesList(value=img_bytes_list)),
'width': tf.train.Feature(int64_list=tf.train.Int64List(value=df['width'])),
'height': tf.train.Feature(int64_list=tf.train.Int64List(value=df['height'])),
'img_class': tf.train.Feature(bytes_list=tf.train.BytesList(value=df['class'].apply(lambda x: x.encode("utf-8")))),
'xmin': tf.train.Feature(int64_list=tf.train.Int64List(value=df['xmin'])),
'ymin': tf.train.Feature(int64_list=tf.train.Int64List(value=df['ymin'])),
'xmax': tf.train.Feature(int64_list=tf.train.Int64List(value=df['xmax'])),
'ymax': tf.train.Feature(int64_list=tf.train.Int64List(value=df['ymax']))}
create example: No Error
example = tf.train.Example(features=tf.train.Features(feature=write_features))
writing data in TfRecord Format: No Error
with tf.io.TFRecordWriter('image_data_tfr') as writer:
writer.write(example.SerializeToString())
Read and print data: No Error
read_features = {"filename": tf.io.VarLenFeature(dtype=tf.string),
"img_arr": tf.io.VarLenFeature(dtype=tf.string),
"width": tf.io.VarLenFeature(dtype=tf.int64),
"height": tf.io.VarLenFeature(dtype=tf.int64),
"class": tf.io.VarLenFeature(dtype=tf.string),
"xmin": tf.io.VarLenFeature(dtype=tf.int64),
"ymin": tf.io.VarLenFeature(dtype=tf.int64),
"xmax": tf.io.VarLenFeature(dtype=tf.int64),
"ymax": tf.io.VarLenFeature(dtype=tf.int64)}
reading single example from tfrecords format: No Error
for serialized_example in tf.data.TFRecordDataset(["image_data_tfr"]):
parsed_s_example = tf.io.parse_single_example(serialized=serialized_example,
features=read_features)
reading image data from tfrecords format: No Error
image_raw = parsed_s_example['img_arr']
encoded_jpg_io = io.BytesIO(image_raw)
Here it is giving error: TypeError: a bytes-like object is required, not 'SparseTensor'
image = Image.open(encoded_jpg_io)
width, height = image.size
print(width, height)
Please tell me what changes are required at the input of "image_arr" so that it will not generate sparse tensor and return a byte format ?
Is there anything that I can do to optimize my existing code?
I have a snappy.parquet file with a schema like this:
{
"type": "struct",
"fields": [{
"name": "MyTinyInt",
"type": "byte",
"nullable": true,
"metadata": {}
}
...
]
}
Update: parquet-tools reveals this:
############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:
11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
The load into the external table works fine with only varchars
CREATE EXTERNAL TABLE [domain].[TempTable]
(
...
MyTinyInt tinyint NULL,
...
)
WITH
(
LOCATION = ''' + #Location + ''',
DATA_SOURCE = datalake,
FILE_FORMAT = parquet_snappy
)
The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.
I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:
there is a known bug in ADF for this particular scenario: The date
type in parquet should be mapped as data type date in Sql sever
however, ADF incorrectly converts this type to Datetime2 which causes
a conflict in PolyBase. I have confirmation for the core engineering
team that this will be rectified with a fix by the end of November and
will be published directly into the ADF product.
In the meantime, as a workaround:
Create the target table with data type DATE as opposed to DATETIME2
Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase
but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem
I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues
I am trying to insert this http://pastebin.com/cKXnCqx7 geometry into an SDO_GEOMETRY field, as follows:
declare
str clob = http://pastebin.com/cKXnCqx7
begin
INSERT INTO TEMP_TEST_GEOMETRY VALUES (SDO_CS.TRANSFORM (SDO_GEOMETRY (str, 4674), 1000205));
end;
And is generating the following error:
The maximum size of a string literal is 32K in PL/SQL. You're trying to assign a 98K+ literal, which is what is causing the error (although since that refers to 4k, that seems to be from a plain SQL call rather than the PL/SQL block you showed). It isn't getting as far as trying to create the sdo_geometry object from the CLOB.
Ideally you'd load the value from a file or some other mechanism that presents you with the complete CLOB. If you have to treat it as a literal you'd have to manually split it up into chunks; at least 4 with this value if you make them large:
declare
str clob;
begin
dbms_lob.createtemporary(str, false);
dbms_lob.append(str, 'POLYGON ((-47.674240062208945 -2.8066454423517624, -47.674313162 -2.8066509996, <...snip...>');
dbms_lob.append(str, '47.6753521374 -2.8067875453, -47.6752566506 -2.8067787567, -47.6752552377 -2.8067785117, <...snip...>');
dbms_lob.append(str, '-47.658134547 -2.8044846153, -47.6581360233 -2.8044849964, -47.6581811229 -2.8044926289, <...snip...>');
dbms_lob.append(str, '-47.6717079633 -2.8057792966, -47.6717079859 -2.80577931, -47.6717083252 -2.8057795101, -47.6718125136 -2.8058408619, -47.6719547186 -2.8059291721, -47.6719573483 -2.8059307844, -47.6719575243 -2.8059308908, -47.6719601722 -2.8059324729, -47.6720975574 -2.8060134925, -47.6721015308 -2.8060157903, -47.6721017969 -2.8060159412, -47.6721058088 -2.806018171, -47.6721847946 -2.8060611947, -47.6721897923 -2.8060650985, -47.6722059263 -2.8060767675, -47.6722070291 -2.8060775047, -47.6722416572 -2.8060971165, -47.6722428566 -2.8060976832, -47.6722611616 -2.8061055189, -47.6722666301 -2.8061076243, -47.6722849174 -2.806116847, -47.6722862515 -2.8061174528, -47.6723066339 -2.8061257231, -47.6723316499 -2.8061347029, -47.6723426416 -2.8061383836, -47.6723433793 -2.8061386131, -47.672354519 -2.8061418177, -47.6723803034 -2.8061486384, -47.6725084039 -2.8061908942, -47.6725130545 -2.8061923817, -47.6725133654 -2.806192478, -47.6725180423 -2.806193881, -47.6728423039 -2.8062879629, -47.6728698649 -2.8062995965, -47.6728952856 -2.8063088527, -47.672897007 -2.8063093833, -47.672949984 -2.8063200428, -47.6729517767 -2.8063202193, -47.672966443 -2.8063209226, -47.6729679855 -2.8063213223, -47.6733393514 -2.8064196858, -47.6733738728 -2.8064264543, -47.6733761939 -2.8064267537, -47.6733804796 -2.8064270239, -47.6733890639 -2.806431641, -47.6734057692 -2.8064398944, -47.6734069006 -2.8064404055, -47.6734241363 -2.8064474849, -47.6736663052 -2.8065373005, -47.6736676833 -2.8065378073, -47.6736677752 -2.8065378409, -47.673669156 -2.8065383402, -47.6737754465 -2.8065764489, -47.673793217 -2.8065850801, -47.6737945765 -2.8065856723, -47.6738366895 -2.8066000123, -47.6738381279 -2.8066003728, -47.6738811819 -2.8066074503, -47.6739137725 -2.8066153813, -47.6739159177 -2.806615636, -47.6739318345 -2.8066165588, -47.673951326 -2.806622013, -47.6739530661 -2.8066223195, -47.6739793945 -2.8066256302, -47.6740948445 -2.8066344025, -47.674240062208945 -2.8066454423517624))');
INSERT INTO TEMP_TEST_GEOMETRY
VALUES (SDO_CS.TRANSFORM (SDO_GEOMETRY (str, 4674), 1000205));
dbms_lob.freetemporary(str);
end;
/
You can make the chunks smaller and have many more appends, which may be more manageable in some respects, but is more work to set up. It's still painful to do manually though, so this is probably a last resort if you can't get the value some other way.
If the string is coming from an an HTTP call you can read the response in chunks and convert it into a CLOB as you read it in much the same way, using dbms_lob.append. Exactly how depends on the mechanism you're using to currently get the response. Also worth noting that Oracle 12c has built-in JSON handling; in earlier versions you may be able to use the third-party PL/JSON module, which seems to handle CLOBs.
I want to insert the image to word file, If I try like the below code the word file show some unknown symbols like
"ΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰΰ"
My code:
figure,imshow(img1);
fid = fopen('mainfile.doc', 'a', 'n', 'UTF-8');
fwrite(fid, img1, 'char');
fclose(fid);
open('mainfile.doc');
fwrite won't do this directly. You could try Matlab report generator if you have access to it or a file exchange submission, OfficeDoc.