I am using apache commons CSV parser to convert the CSV to a map. In the map I couldnt able to read some values through intellij debuger. if I manually type map.get("key") the value is null. However, if I copy paste the key from the map, I am getting data. Couldnt understand what is going wrong. Any pointers would help. Thanks
Here is my CSV parser code:
private CSVParser parseCSV(InputStream inputStream) {
System.out.println("What is the encoding "+ new InputStreamReader(inputStream).getEncoding());
try {
return new CSVParser(new InputStreamReader(inputStream), CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withIgnoreHeaderCase()
.withSkipHeaderRecord()
.withTrim());
} catch (IOException e) {
throw new IPRSException(e);
}
}
There was a weird character in the strings (Reference: Reading UTF-8 - BOM marker). The below syntax help to resolve the issue
header = header("\uFEFF", "");
in java use UnicodeReader:
String path = "demo.csv";
CSVFormat.Builder builder = CSVFormat.RFC4180.builder();
CSVFormat format = builder.setQuote(null).setHeader().build();
InputStream in = new FileInputStream(new File(path));
CSVParser parser = new CSVParser(new BufferedReader(new UnicodeReader(in)), format);
Related
I am trying to read an Excel file in manipulate it or add new data to it and write it back out. I am also trying to do this a complete reactive process using Flux and Mono. The Idea is to return the resulting file or bytearray via a webservice.
My question is how do I get a InputStream and OutputStream in a non blocking way?
I am using the Apache Poi library to read and generate the Excel File.
I currently have a solution based around a mix of Mono.fromCallable() and Blocking code getting the Input Stream.
For example the webservice part is as follows.
#GetMapping(value = API_BASE_PATH + "/download", produces = "application/vnd.ms-excel")
public Mono<ByteArrayResource> download() {
Flux<TimeKeepingEntry> createExcel = excelExport.createDocument(false);
return createExcel.then(Mono.fromCallable(() -> {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
excelExport.getWb().write(outputStream);
return new ByteArrayResource(outputStream.toByteArray());
}).subscribeOn(Schedulers.elastic()));
}
And the Processing of the file:
public Flux<TimeKeepingEntry> createDocument(boolean all) {
Flux<TimeKeepingEntry> entries = null;
try {
InputStream inputStream = new ClassPathResource("Timesheet Template.xlsx").getInputStream();
wb = WorkbookFactory.create(inputStream);
Sheet sheet = wb.getSheetAt(0);
log.info("Created document");
if (all) {
//all entries
} else {
entries = service.findByMonth(currentMonthName).log("Excel Export - retrievedMonths").sort(Comparator.comparing(TimeKeepingEntry::getDateOfMonth)).doOnNext(timeKeepingEntry-> {
this.populateEntry(sheet, timeKeepingEntry);
});
}
} catch (IOException e) {
log.error("Error Importing File", e);
}
return entries;
}
This works well enough but not very in line with Flux and Mono. Some guidance here would be good. I would prefer to have the whole sequence non-blocking.
Unfortunately the WorkbookFactory.create() operation is blocking, so you have to perform that operation using imperative code. However fetching each timeKeepingEntry can be done reactively. Your code would looks something like this:
public Flux<TimeKeepingEntry> createDocument() {
return Flux.generate(
this::getWorkbookSheet,
(sheet, sink) -> {
sink.next(getNextTimeKeepingEntryFrom(sheet));
},
this::closeWorkbook);
}
This will keep the workbook in memory, but will fetch each entry on demand when the elements of the Flux are requested.
I am trying to Index PDF files in elastic search 6.3.2 using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it?
Following is my code:
InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
try {
byte[] fileByteStream = IOUtils.toByteArray(inputStream );
String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
this.stream.close();
JSONObject correspondenceNode = new JSONObject();
correspondenceNode.put("data",strEncoded );
String strSsonValues = correspondenceNode.toString();
HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
elasticrestClient.put("/2018/documents/"1, entity);
} catch (IOException e) {
e.printStackTrace();
}
Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.
following is the code for decoding:
String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );
FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();
This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:
You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.
Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:
{
"content": "Extracted text here",
"meta": {
// Meta data there
},
"url": "file://path/to/file"
}
You can also look at FSCrawler (including its code) which does basically that.
I am new to Spring Boot. I have this emailprop.properties in src/main/resource:
//your private key
mail.smtp.dkim.privatekey=classpath:/emailproperties/private.key.der
But I am getting the error as
classpath:\email properties\private.key.der (The filename, directory
name, or volume label syntax is incorrect)
How do I properly load this file?
Update-1
my java code is
dkimSigner = new DKIMSigner(emailProps.getProperty("mail.smtp.dkim.signingdomain"), emailProps.getProperty("mail.smtp.dkim.selector"),
emailProps.getProperty("mail.smtp.dkim.privatekey"));
its working as "D:\\WorkShop\\MyDemoProj\\EmailService\\src\\main\\resources\\private.key.der"Instead of emailProps.getProperty("mail.smtp.dkim.privatekey")
Update-2
i have tried java code is
String data = "";
ClassPathResource cpr = new ClassPathResource("private.key.der");
try {
byte[] bdata = FileCopyUtils.copyToByteArray(cpr.getInputStream());
data = new String(bdata, StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
dkimSigner = new DKIMSigner(emailProps.getProperty("mail.smtp.dkim.signingdomain"), emailProps.getProperty("mail.smtp.dkim.selector"),data);
Error is : java.io.FileNotFoundException: class path resource [classpath:private.key.der] cannot be resolved to URL because it does not exist
Tried Code is :
ClassPathResource resource = new ClassPathResource(emailProps.getProperty("mail.smtp.dkim.privatekey"));
File file = resource.getFile();
String absolutePath = file.getAbsolutePath();
Still same error..
please update the answer..
If you want to load this file runtime then you need to use ResourceLoader please have a look here for the documentation - section 8.4.
Resource resource = resourceLoader.getResource("classpath:/emailproperties/private.key.der");
Now if you want to keep this exact path in properties file you can keep it there and then load it in your Autowired constructor/field like that:
#Value("${mail.smtp.dkim.privatekey}") String pathToPrivateKey
and then pass this to the resource loader.
Full example you can find here. I don't want to copy paste it.
If your file is located here:
"D:\\WorkShop\\MyDemoProj\\EmailService\\src\\main\\resources\\private.key.der"
then it should be:
mail.smtp.dkim.privatekey=classpath:private.key.der
EDIT:
I see now, you are using DKIMSigner, which expects file-path string,
Try changing your code like this:
ClassPathResource resource = new ClassPathResource(emailProps.getProperty("mail.smtp.dkim.privatekey"));
File file = resource.getFile();
String absolutePath = file.getAbsolutePath();
dkimSigner = new DKIMSigner(emailProps.getProperty("mail.smtp.dkim.signingdomain"), emailProps.getProperty("mail.smtp.dkim.selector"),absolutePath
);
I am trying to replace an existing Community file using the following java
Map<String, String> paramsMap = new HashMap<String, String>();
paramsMap.put("createVersion", "false");
fileEntry = fileService.updateCommunityFile(fis, fileUuid, fileName, communityLibraryId, paramsMap);
But it is returning a HTTP 411:Length required error.
I am using the latest build (1.1.5.20150520-1200.jar)
Does anyone have a suggestion as to what i am missing?
I tried recreating the issue but I am able to upload New version of Community file correctly with and without version, using the updateCommunityFile API. I do not get any Length related error. This is the snippet I am using :
java.io.File file = new java.io.File("C://TestUploadCommunity.txt");
FileInputStream fis = null;
try {
fis = new FileInputStream(file);
} catch (Exception e) {
//TODO
}
fileEntry = fileService.updateCommunityFile(fis, fileEntry.getFileId(), fileEntry.getLabel(), communityLibraryId, params);
Can you share more details on your sample, what exactly is your fis?
I have tried this on 2 environments and I do not see any issue.
Also, from the entry you have pasted,
"Request to url apps.na.collabserv.com/files/basic/api/library... /document/... /entry?content-length=6600&createVersion=false returned an error response 411:Length Required HTTP/1.1 411"
It seems that somehow an incorrect content-length is passed for your request.
Can you share the sample that you are using?
I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.
There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.
#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});
Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)
I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}
This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path
Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")