Does super CSV support parsing a file header? - supercsv

I have a CSV that has a few file header lines at top. The rest of the rows are in the normal tablular format. Is it possible to parse the header row or process it differently from the remainder of the normal tabular data?

You can Get the headers separately quite simple.
headers are in line no 1, which makes it simple to fetch them.
here is an example:
listReader = new CsvListReader(new FileReader(CSV_FILENAME), CsvPreference.STANDARD_PREFERENCE);
final CellProcessor[] processors = getProcessors();
List<Object> customerList;
while( (customerList = listReader.read(processors)) != null ) {
System.out.println(String.format("lineNo=%s, rowNo=%s, customerList=%s", listReader.getLineNumber(), listReader.getRowNumber(), customerList));
if(listReader.getRowNumber()==1)
{
// do what ever you need with the headers...
}

Related

How to substitute particular field of FlowFile by the field from another FlowFile?

I have the output FlowFile #1 of ExecuteScript and the output FlowFile #2 of another ExecuteScript.
FlowFile #1
{
"field1": "val",
"field2": "val"
}
FlowFile #2
{
"field2": "abc"
}
Which processor should I use in order to substitute the value val of field1 in FlowFile #1 by the value abc of field2 from FlowFile #2?
I don't want to use MergeContent, because what I need is just to replace the value.
UPDATE:
In UpdateAttribute I set the property filename equal to ${UUID()}. Then in ExecuteScript named as Merge inputs into single FlowFile I use the code shown below. The files are not merged and are queued.
The output of Replacetext is like FlowFile #2, and the output of UpdateAttribute is like FlowFile #1.
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def filename = ff0.getAttribute('filename')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( filename == ff.getAttribute('filename') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
UDPATE #2:
The full details can be found in the chat thanks to #dagget, but to ensure this question does not seem unanswered I will post the key points here.
Ensure that an id is added to the flow before it diverges, this allows you to bring the relevant message pairs together again.
Disable the rollback(penalty) to avoid waiting too long for both halves of a message to become available again

Talend: Save variable for later use

I´m trying to save a value in spreadsheet's header for later use as a new column value.
This is the reduced version with value (XYZ) in header:
The value in header must be used for new column CODE:
This is my design:
tFilterRow_1 is used to reject rows without values in A, B, C columns.
There is a conditional in tJavaRow_1 to set a global variable:
if(String.valueOf(row1.col_a).equals("CODE:")){
globalMap.putIfAbsent("code", row1.col_b);
}
The Var expression in tMap_1 to get the global variable is:
(String)globalMap.get("code")
The Var "code" is mapped to column "code" but I'm getting this output:
a1|b1|c1|
a2|b2|c2|
a3|b3|c3|
What is missed or there is a better approach to accomplish this escenario ?
Thanks in advance.
Short answer:
I tJavaRow use the input_row or the actual rowN in this case row4.
Longer answer, how I'd do it.
I'd do is let the excel flow in AS-IS. By using some Java tricks we can simply skip the first few rows then let the rest of the flow go through.
So the filter + tjavarow combo can be replaced with a tJavaFlex.
tJavaFlex I'd do:
begin:
boolean contentFound = false;
main
if(input_row.col1 != null && input_row.col1.equalsIgnoreCase("Code:") ) {
globalMap.put("code",input_row.col2);
}
if(input_row.col1 != null && input_row.col1.equalsIgnoreCase("Column A:") ) {
contentFound = true;
} else {
if(false == contentFound) continue;
}
This way you'll simply skip the first few records (i.e header) and only care about the actual data.

Java 8 Streams Filter a list based on a condition

I am trying to extract a filtered list on top of the original list based on some condition. I am using backport version of Java 8 and am not pretty sure how to do this.I get the Set from ccarReport.getCcarReportWorkflowInstances() call. I need to iterate and filter this set based on a condition match( I am comparing the date attribute in each object with the request date being passed. Below is the code
Set<CcarReportWorkflowInstance> ccarReportWorkflowInstanceSet = ccarReport.getCcarReportWorkflowInstances();
List<CcarReportWorkflowInstance> ccarReportWorkflowInstances = StreamSupport.stream(ccarReportWorkflowInstanceSet).filter(ccarReportWorkflowInstance -> DateUtils.isSameDay(cobDate, ccarReportWorkflowInstance.getCobDate()));
The routine which is doing the job
public List<CcarRepWfInstDTO> fetchReportInstances(Long reportId, Date cobDate) {
List<CcarRepWfInstDTO> ccarRepWfInstDTOs = null;
CcarReport ccarReport = validateInstanceSearchParams(reportId, cobDate);
Set<CcarReportWorkflowInstance> ccarReportWorkflowInstanceSet = ccarReport.getCcarReportWorkflowInstances();
List<CcarReportWorkflowInstance> ccarReportWorkflowInstances = StreamSupport.stream(ccarReportWorkflowInstanceSet).filter(ccarReportWorkflowInstance -> DateUtils.isSameDay(cobDate, ccarReportWorkflowInstance.getCobDate()));
ccarRepWfInstDTOs = ccarRepWfInstMapper.ccarRepWfInstsToCcarRepWfInstDTOs(ccarReportWorkflowInstances);
return ccarRepWfInstDTOs;
}
Error I get when I tried to use streams.
Assuming I understood what you are trying to do, you can replace your method body with a single line :
return
validateInstanceSearchParams(reportId, cobDate).getCcarReportWorkflowInstances()
.stream()
.filter(c -> DateUtils.isSameDay(cobDate, c.getCobDate()))
.collect(Collectors.toList());
You can obtain a Stream from the Set by using the stream() method. No need for StreamSupport.stream().
After filtering the Stream, you should collect it into the output List.
I'd use shorter variable and method names. Your code is painful to read.

Multiple Insert into Postgresql using Criteria is very slow

I am reading a big text file using Java. The file has 5.000.000 of rows and each one have 3 columns. The file size is 350 MB.
For each row, I read it, I create an object using Criteria on Maven and I store it into a Postgresql database with a session.saveOrUpdate(object) command.
In the database I have a table with a serial ID and three attributes where I store the three columns of the file.
At the beginning, the process run "fast" (35.000 registers in 30 min) but every time is slower and the time to finish grow exponentially. How can I improve the process??
I have tried to split the big file into several smaller files but it is almost slower.
Many thanks in advance!
PD: The code
public void process(){
File archivo = null;
FileReader fr = null;
BufferedReader br = null;
String linea;
String [] columna;
try{
archivo = new File ("/home/josealopez/Escritorio/file.txt");
fr = new FileReader (archivo);
br = new BufferedReader(fr);
while((linea=br.readLine())!=null){
columna = linea.split(";");
saveIntoBBDD(columna[0],columna[1],columna[2]);
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
try{
if( null != fr ){
fr.close();
}
}
catch (Exception e2){
e2.printStackTrace();
}
}
}
#CommitAfter
public void saveIntoBBDD(String lon, String lat, String met){
Object b = new Object();
b.setLon(Double.parseDouble(lon));
b.setLat(Double.parseDouble(lat));
b.setMeters(Double.parseDouble(met));
session.saveOrUpdate(b);
}
You should focus on running this as a bulk process and line-based processing is your issue here. PostgreSQL has built-in command for bulk file loading, named COPY, that can deal with Comma Separated Files and Tab Separated Files. Of course, delimiter, quotations chars and many other settings are customizable.
Please, check official PostgreSQL documentation on DB population and also details of the COPY command.
In this answer I provided a small example of how I do similar kind of things.

Read data from txt files - using Linq-Entities?

in my current ASP.net MVC 3.0 project i am stuck with a situation.
I have four .txt files each has approximatly 100k rows of records
These files will be replaced with new files on weekly bases.
I need to Query data from these four text files, I am not able to choose the best and efficient way to do this.
3 ways I could think
Convert these text files to XML on a weekly basis and query it with Linq-XML
Run a batch import weekly from txt to SQL Server and query using Linq-Entities
avoid all conversions and query directly from text files.
Can any one suggest a best way to deal with this situation.
update:
Url of the Text File
I should connect to this file with credentials.
once i connect successfully, I will have the text file as below with Pipeline as Deliminator
This is the text file
Now i have to look up for the field highlighted in yellow and get the data in that row.
Note: First two lines of the text file are headers of the File.
Well As i Found a way my self. Hope this will be useful for any who are interested to get this done.
string url = "https://myurl.com/folder/file.txt";
WebClient request = new WebClient();
request.Credentials = new NetworkCredential(ConfigurationManager.AppSettings["UserName"], ConfigurationManager.AppSettings["Password"]);
Stream s = request.OpenRead(url);
using (StreamReader strReader = new StreamReader(s))
{
for (int i = 0; i <= 1; i++)
strReader.ReadLine();
while (!strReader.EndOfStream)
{
var CurrentLine = strReader.ReadLine();
var count = CurrentLine.Split('|').Count();
if (count > 3 && CurrentLine.Split('|')[3].Equals("SearchString"))
{
#region Bind Data to Model
//var Line = CurrentLine.Split('|');
//CID.RecordType = Line[0];
//CID.ChangeIdentifier = Line[1];
//CID.CoverageID = Convert.ToInt32(Line[2]);
//CID.NationalDrugCode = Line[3];
//CID.DrugQualifier = Convert.ToInt32(Line[4]);
#endregion
break;
}
}
s.Close();
}
request.Dispose();

Resources