Encoding to UTF-8 files in hadoop - hadoop

I'm writing a MapReduce program in order to clean some files stored in HDFS, for that i have to encode all files in UTF-8, i tried to encode the Text value in my mapper but i still have errors in my result file.
if(encoding.compareTo("UTF-8")!=0){
final Charset fromCharset = Charset.forName(encoding);
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(value.toString().getBytes(fromCharset), toCharset);
result= new String(fixed);
I also custom the LineReader in order to encode the bytes readed into UTF-8 before that it's stored in Text Object.
//buffer contain the data readed in a line of the file
String s = new String(buffer, startPosn, appendLength);
byte ptext[] = Charset.forName("UTF-8").encode(s).array();
str.append(ptext, 0, ptext.length);
Can you help me please !

I found the response:
if(encoding.compareTo("CP1252")==0)
valueInString= new String(value.getBytes(),
0, value.getLength(),
StandardCharsets.ISO_8859_1);
else valueInString=value.toString();

Related

Apache HTTP Client forcing UTF-8 encoding

I'm making a rest call using the org.apache.http package as below. I'm expecting user profile details in the response in English and other international languages.
HttpGet req = new HttpGet(baseUrl + uri);
HttpResponse res= closeableHttpClient.execute(req);
The response has UTF-8 as character set, which is what I wanted. From here, I used 2 approaches to unmarshall the response to a map.
Approach-1:
String response = EntityUtils.toString(res.getEntity(),"UTF-8");
// String response = EntityUtils.toString(httpResponse.getEntity(),Charset.forName("UTF-8"));
map = jsonConversionUtil.convertStringtoMap(response);
Issue:
httpResponse.getEntity() was returning StringEntity object which had default charset as ISO_8859_1, but even when I force to convert to UTF-8 (uncommmented line and commented line above, both I tried), I'm not able to override to UTF-8.
Approach-2:
HttpEntity responseEntity = res.getEntity();
if (responseEntity != null ) {
InputStream contentStream = responseEntity.getContent();
if (contentStream != null) {
String response = IOUtils.toString(contentStream, "UTF-8");
map = jsonConversionUtil.convertStringtoMap(response);
}
}
Issue:
IOUtils.toString(contentStream, "UTF-8"); is not setting to UT8.
I am using httpclient 4.3.2 jar & httpcore-4.3.1 jar. Java version used in Java 6. I can't upgrade to a higher java version.
Can you please guide how I can set to UTF-8 format.
If the StringEntity object has an ISO-8859-1 encoding, then the server has returned its response encoded as ISO-8859-1. Your assumption that "the response has UTF-8 as character set" is most likely wrong.
Since it's ISO-8859-1, both your approaches don't work:
Approach 1: The "UTF-8" parameter has no effect as the parameter specifies the default encoding in case the server doesn't specify one (see EntityUtils.toString(). But the server has obviously specified one.
Approach 2: Reading the binary content as UTF-8, which is in fact encoded in ISO-8859-1, will likely result in garbage (though many characters have a similar representation in UTF-8 and ISO-8859-1).
So try to ask the server to return UTF-8:
HttpGet req = new HttpGet(baseUrl + uri);
req.addHeader("Accept", "application/json");
req.addHeader("Accept-Charset", "utf-8");
HttpResponse res = closeableHttpClient.execute(req);
If it disregards the specified characters set and still returns JSON in ISO-8859-1, then it will be unable to use characters outside the ISO-8859-1 range (unless it uses escaping within JSON).

saveAsNewAPIHadoopFile changing the character encoding to UTF-8

I am trying to save the RDD with ISO-8859-1 charset encoded using saveAsNewAPIHadoopFile to AWS S3 bucket
But its changing the character encoding to UTF-8 when its saved to S3 bucket.
Code snippet
val cell = “ MYCOST £25” //This is in UTF-8 character encoding .
val charset: Charset = Charset.forName(“ISO-8859-1”)
val cellData = cell.padTo(50, “ “).mkString
val iso-data = new String(cellData.getBytes(charset), charset) // here it converts the string from UTF-8 to ISO-8859-1
But when I save the file using saveAsNewAPIHadoopFile then it changes to UTF-8 format.
I think saveAsNewAPIHadoopFile TextOutputFormat automatically converting the file encoding to UTF-8. Is there a way I can save the content to S3 bucket with the same encoding (ISO-8859-1)
ds.rdd.map { record =>
val cellData = record.padTo(50, “ “).mkString
new String(cellData.getBytes(“ISO-8859-1”), “ISO-8859-1”)
}.reduce { _ + _ }
}.mapPartitions { iter =>
val text = new Text()
iter.map { item =>
text.set(item)
(NullWritable.get(), text)
}
}.saveAsNewAPIHadoopFile(“”s3://mybucket/“, classOf[NullWritable], classOf[BytesWritable], classOf[TextOutputFormat[NullWritable, BytesWritable]])
Appreciate your help
I still haven't got the correct answer but as a workaround, I am copying the file to HDFS and converting the file to ISO format using ICONV and saving back to S3 bucket. This is doing the job for me but it requires extra two steps in EMR cluster.
I thought it might be useful to anyone who comes across the same problem

JRecord - Formatting file transferred from Mainframe

I am trying to display a mainframe file in a eclipse RCP application using JRecord library.I already have the COBOL copybook as a text file.
to accomplish that,
I am transferring the file from mainframe to my desktop through
apache commons net FTPClient API
Now I have a text file
I am removing the newline and carriage return characters
then I read it via ., a CobolIoProvider and convert it into a ArrayList of type AbstractLine
But I have offset issues because of some special charcters .
here are the issues
when I dont perform step #3 , there are offset issues right from
record 1. hence I included step #3
even when I perform step #3 , the first few thounsands of records seem to be formatted(or read ) by the AbstractLineReader correctly unless it encounters a special character (not sure but thats my assumption).
Code snippet:
ArrayList<AbstractLine> lines = new ArrayList<AbstractLine>();
InputStream copyStream;
InputStream fis;
try {
copyStream = new FileInputStream(new File(copybookfile));
String filec = FileUtils.readFileToString(new File(datafile));
System.out.println("initial len: "+filec.length());
filec=filec.replaceAll("\r", "");
filec=filec.replaceAll("\n", "");
System.out.println("initial len: "+filec.length());
fis= new ByteArrayInputStream(filec.getBytes());
CobolIoProvider ioProvider = CobolIoProvider.getInstance();
AbstractLineReader reader = ioProvider.newIOBuilder(copyStream, "REQUEST",
Convert.FMT_MAINFRAME).newReader(fis);
AbstractLine line;
while ((line = reader.read()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
What am I missing here ? is there an additional preprocessing that I need to do for the file transferred from mainframe ?
If it is a Text File (no binary data) with \r\n line delimiters try:
ArrayList<AbstractLine> lines = new ArrayList<AbstractLine>();
InputStream copyStream;
InputStream fis;
try {
copyStream = new FileInputStream(new File(copybookfile));
AbstractLineReader reader = CobolIoProvider.getInstance()
.newIOBuilder(copyStream, "REQUEST", ICopybookDialects.FMT_MAINFRAME)
.setFileOrganization(Constants.IO_STANDARD_TEXT_FILE)
.newReader(datafile);
AbstractLine line;
while ((line = reader.read()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Note: The setFileOrganization tells JRecord what type of file it is. So .setFileOrganization(Constants.IO_STANDARD_TEXT_FILE) tells JRecord it is a Text file with \n or \r\n end-of-line markers. Here is a Description of FileOrganisation in JRecord.
The special charcters worry me though, if there is a \n in the 'Data' it will be treated as an end-of-line. You may need to do binary transfer and keep the RDW (Record-Descriptor-Word) if it is a VB file.
If The file contains Binary data, you will need:
do a binary transfer (with RDW if it is a VB file)
use the appropriate File-Organisation
Specify Ebcdic (.setFont("cp037") tells JRecord is US-Ebcdic)
I will add a second answer for Generating Code using the RecordEditor
If you are absolutely sure all the records are the same length you can use the low-level routines to do the reading see the ReadAqtrans.java program in https://sourceforge.net/p/jrecord/discussion/678634/thread/4b00fed4/
basically you would do:
ICobolIOBuilder iobuilder = CobolIoProvider.getInstance()
.newIOBuilder("copybookFileName", ICopybookDialects.FMT_MAINFRAME)
.setFont("CP037")
.setFileOrganization(Constants.IO_FIXED_LENGTH);
LayoutDetail layout = iobuilder.getLayout();
FixedLengthByteReader br
= new FixedLengthByteReader(layout.getMaximumRecordLength() + 2);
br.open("...");
byte[] bytes;
while ((bytes = br.read()) != null) {
lines.add(iobuilder.newLine(bytes));
}
Future Reference / Binary File
If the file does contain Binary Data, you really need to do a binary transfer. You may find the RecordEditor useful.
The RecordEditor 0.98 has a JRecord code Generation
function. The advantages of using the RecordEditor Generate function are
The Recordeditor will try and work out the appropriate File attributes by looking at the File
You can try out various attributes (left hand pane) and see what the file looks like with those attributes
(right hand side).
When happy, hit the Generate button and the RecordEditor will generate JRecord code. There are several Code Templates
available:
Standard - will generate basic JRecord code (with a field name class
lineWrapper - will generate a "wrapper" class with the Cobol fields represented as get/set methods
RecordEditor Generate
In the RecordEditor select Generate >>> Java~JRecord code for Cobol
Generate Screen
Enter the Cobol CopyBook / Sample file and adjust the attributes as needed
Code Template
Next you can select the Code Template
Generated Code
Finally the RecordEditor will generate JRecord code based on the Attributes entered.

ParquetWriter outputs empty parquet file in a java stand alone program

I tried to convert existing avro file to parquet. But the output parquet file is empty. I am not sure what I did wrong...
My code snippet:
FileReader<GenericRecord> fileReader = DataFileReader.openReader(
new File("output/users.avro"), new GenericDatumReader<GenericRecord>());
Schema avroSchema = fileReader.getSchema();
// generate the corresponding Parquet schema
MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
// choose compression scheme
CompressionCodecName compressionCodecName = CompressionCodecName.UNCOMPRESSED;
// set Parquet file block size and page size values
int pageSize = 64 * 1024;
Path outputPath = new Path("output/users.parquet");
// create a parquet writer using builder
ParquetWriter parquetWriter = (ParquetWriter) AvroParquetWriter.builder(outputPath)
.withSchema(avroSchema)
.withCompressionCodec(compressionCodecName)
.withPageSize(pageSize)
.build();
// read avro, write parquet
while (fileReader.hasNext()) {
GenericRecord record = fileReader.next();
System.out.println(record);
parquetWriter.write(record);
}
I had the same problem and found that I needed to close the parquetWriter before the data was committed to the file. It just needs you to add
parquetWriter.close();
after the while loop.

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

Resources