saveAsNewAPIHadoopFile changing the character encoding to UTF-8 - hadoop

I am trying to save the RDD with ISO-8859-1 charset encoded using saveAsNewAPIHadoopFile to AWS S3 bucket
But its changing the character encoding to UTF-8 when its saved to S3 bucket.
Code snippet
val cell = “ MYCOST £25” //This is in UTF-8 character encoding .
val charset: Charset = Charset.forName(“ISO-8859-1”)
val cellData = cell.padTo(50, “ “).mkString
val iso-data = new String(cellData.getBytes(charset), charset) // here it converts the string from UTF-8 to ISO-8859-1
But when I save the file using saveAsNewAPIHadoopFile then it changes to UTF-8 format.
I think saveAsNewAPIHadoopFile TextOutputFormat automatically converting the file encoding to UTF-8. Is there a way I can save the content to S3 bucket with the same encoding (ISO-8859-1)
ds.rdd.map { record =>
val cellData = record.padTo(50, “ “).mkString
new String(cellData.getBytes(“ISO-8859-1”), “ISO-8859-1”)
}.reduce { _ + _ }
}.mapPartitions { iter =>
val text = new Text()
iter.map { item =>
text.set(item)
(NullWritable.get(), text)
}
}.saveAsNewAPIHadoopFile(“”s3://mybucket/“, classOf[NullWritable], classOf[BytesWritable], classOf[TextOutputFormat[NullWritable, BytesWritable]])
Appreciate your help

I still haven't got the correct answer but as a workaround, I am copying the file to HDFS and converting the file to ISO format using ICONV and saving back to S3 bucket. This is doing the job for me but it requires extra two steps in EMR cluster.
I thought it might be useful to anyone who comes across the same problem

Related

quote_from_bytes() expected bytes error when data is in bytes format when uploading to blob storage in Python3

Can someone tell me what I am doing wrong with trying to upload an image to blob storage? Below is my code.
print(type(img['image'])) #Output is <class 'bytes'>
connection_string = get_blob_connection_string()
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container="images", blob=img['id'])
exists = blob_client.exists()
if (exists == False):
result = blob_client.upload_blob(img['image'], blob_type="blockblob")
print(result)
When inserting the blob, it throws the error
quote_from_bytes() expected bytes
This error makes no sense, I gave it bytes. What am I missing?
After reproducing from my end, I have received the same issue. You are receiving this error because of incompatible type of the file (i.e., file format).
After changing the below line to the correct format I could able to achieve your requirement.
blob_client = blob_service_client.get_blob_client(container="images", blob=img['id'])
Below is the correct format
blob_client=blob_service_client.get_blob_client(container='container', blob='<LOCAL FILE PATH>');
Below is the complete code that worked for me
from azure.storage.blob import BlobServiceClient
from PIL import Image
connection_string = "<CONNECTION STRING>"
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client=blob_service_client.get_blob_client(container='container', blob='<LOCALPATH>');
with open(file='<PATH IN YOUR STORAGE ACCOUNT WITH FILE NAME>', mode="rb") as data:
blob_client.upload_blob(data)
RESULTS:

Apache HTTP Client forcing UTF-8 encoding

I'm making a rest call using the org.apache.http package as below. I'm expecting user profile details in the response in English and other international languages.
HttpGet req = new HttpGet(baseUrl + uri);
HttpResponse res= closeableHttpClient.execute(req);
The response has UTF-8 as character set, which is what I wanted. From here, I used 2 approaches to unmarshall the response to a map.
Approach-1:
String response = EntityUtils.toString(res.getEntity(),"UTF-8");
// String response = EntityUtils.toString(httpResponse.getEntity(),Charset.forName("UTF-8"));
map = jsonConversionUtil.convertStringtoMap(response);
Issue:
httpResponse.getEntity() was returning StringEntity object which had default charset as ISO_8859_1, but even when I force to convert to UTF-8 (uncommmented line and commented line above, both I tried), I'm not able to override to UTF-8.
Approach-2:
HttpEntity responseEntity = res.getEntity();
if (responseEntity != null ) {
InputStream contentStream = responseEntity.getContent();
if (contentStream != null) {
String response = IOUtils.toString(contentStream, "UTF-8");
map = jsonConversionUtil.convertStringtoMap(response);
}
}
Issue:
IOUtils.toString(contentStream, "UTF-8"); is not setting to UT8.
I am using httpclient 4.3.2 jar & httpcore-4.3.1 jar. Java version used in Java 6. I can't upgrade to a higher java version.
Can you please guide how I can set to UTF-8 format.
If the StringEntity object has an ISO-8859-1 encoding, then the server has returned its response encoded as ISO-8859-1. Your assumption that "the response has UTF-8 as character set" is most likely wrong.
Since it's ISO-8859-1, both your approaches don't work:
Approach 1: The "UTF-8" parameter has no effect as the parameter specifies the default encoding in case the server doesn't specify one (see EntityUtils.toString(). But the server has obviously specified one.
Approach 2: Reading the binary content as UTF-8, which is in fact encoded in ISO-8859-1, will likely result in garbage (though many characters have a similar representation in UTF-8 and ISO-8859-1).
So try to ask the server to return UTF-8:
HttpGet req = new HttpGet(baseUrl + uri);
req.addHeader("Accept", "application/json");
req.addHeader("Accept-Charset", "utf-8");
HttpResponse res = closeableHttpClient.execute(req);
If it disregards the specified characters set and still returns JSON in ISO-8859-1, then it will be unable to use characters outside the ISO-8859-1 range (unless it uses escaping within JSON).

How to read a COMPRESS()-ed H2 blob column via JDBC?

I have a file-based H2 database (engine version 1.4.196) with a mediumblob column containing data returned by the COMPRESS() function:
create table foo (compressed_data mediumblob);
...
insert into foo (compressed_data) values (COMPRESS(STRINGTOUTF8('Test'), 'DEFLATE'));
(The table is created and filled by flyway.)
I'd like to read this data in a JDBC client without calling DECOMPRESS() first. (I want to do the decompression client-side for compatibility with another system). I've tried to read the data via an InflaterInputStream, which can uncompress DEFLATE data:
try (InputStream dbStream = rs.getBinaryStream("compressed_data");
InflaterInputStream inflaterStream = new InflaterInputStream(dbStream);
) {
inflaterStream.read();
...
But this causes an error:
java.util.zip.ZipException: incorrect header check
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
...
Is there any way I can get InflaterInputStream-compatible compressed data from a column in H2?
Since you are already using H2 JDBC to access the database you can simply retrieve the compressed data with getBytes and use the expand method of org.h2.tools.CompressTool to uncompress it:
// .java source file is Cp1252 encoded
String sql = "SELECT COMPRESS(STRINGTOUTF8('fermé'), 'DEFLATE') AS foo";
ResultSet rs = st.executeQuery(sql);
rs.next();
byte[] bytesOut = rs.getBytes(1);
byte[] expanded = org.h2.tools.CompressTool.getInstance().expand(bytesOut);
String strOut = new String(expanded, "UTF-8");

Encoding to UTF-8 files in hadoop

I'm writing a MapReduce program in order to clean some files stored in HDFS, for that i have to encode all files in UTF-8, i tried to encode the Text value in my mapper but i still have errors in my result file.
if(encoding.compareTo("UTF-8")!=0){
final Charset fromCharset = Charset.forName(encoding);
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(value.toString().getBytes(fromCharset), toCharset);
result= new String(fixed);
I also custom the LineReader in order to encode the bytes readed into UTF-8 before that it's stored in Text Object.
//buffer contain the data readed in a line of the file
String s = new String(buffer, startPosn, appendLength);
byte ptext[] = Charset.forName("UTF-8").encode(s).array();
str.append(ptext, 0, ptext.length);
Can you help me please !
I found the response:
if(encoding.compareTo("CP1252")==0)
valueInString= new String(value.getBytes(),
0, value.getLength(),
StandardCharsets.ISO_8859_1);
else valueInString=value.toString();

Is there a way to remove the BOM from a UTF-8 encoded file?

Is there a way to remove the BOM from a UTF-8 encoded file?
I know that all of my JSON files are encoded in UTF-8, but the data entry person who edited the JSON files saved it as UTF-8 with the BOM.
When I run my Ruby scripts to parse the JSON, it is failing with an error.
I don't want to manually open 58+ JSON files and convert to UTF-8 without the BOM.
With ruby >= 1.9.2 you can use the mode r:bom|utf-8
This should work (I haven't test it in combination with json):
json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
json = JSON.parse(file.read)
}
It doesn't matter, if the BOM is available in the file or not.
Andrew remarked, that File#rewind can't be used with BOM.
If you need a rewind-function you must remember the position and replace rewind with pos=:
#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
f << "\xEF\xBB\xBF" #add BOM
f << 'some content'
}
#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
pos =f.pos
p content = f.read #read and write file content
f.pos = pos #f.rewind goes to pos 0
p content = f.read #(re)read and write file content
}
So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end
You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.
File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')
the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:
def ignore_bom
#file.ungetc if #file.pos==0 && #file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end
which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.
Server side cleanup of utf-8 bom bytes that worked for me:
csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

Resources