Stripping out headers from CSV file loading - hadoop

I have a file that takes the following format.
TIMESTAMP=Jan 20 10:22:43 2014
TYPE=text
BEGIN-FILE
value1,value2,value3,value4,value5,value6
value1,value2,value3,value4,value5,value6
value1,value2,value3,value4,value5,value6
END-FILE
I want to load this file into Hadoop, unfortunately, all the files contain lines of meta information that I want to strip out. Is there a way in Pig (or any other method) where I can ignore all lines that do not contain comma's?

In Pig, you can use the FILTER command to remove those lines if you just want to throw them away. You could do this multiple ways; here are a couple of possibilities:
Load entire lines into a single field, filter out the ones which can't be split on comma into 6 fields, and then split them out for use in your script:
a = LOAD 'file' USING PigStorage('\n') AS (line:chararray);
b = FILTER a BY SIZE(STRSPLIT(line, ',', 6)) == 6;
c = FOREACH a GENERATE FLATTEN(STRSPLIT(line, ',', 6)) AS (/*put your schema here*/);
Load as comma-separated file, and then throw away any lines with NULL in the 6th field:
a = LOAD 'file' USING PigStorage(',') AS (/*put your schema here*/);
b = FILTER a BY $5 IS NOT NULL;

In MapReduce, have the first line of your mapper parse the line it is reading. You can do this with custom parsing logic, or you can leverage pre-built code (in this case, a CSV library).
protected void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
CSVParser parser = new au.com.bytecode.opencsv.CSVParser.CSVParser();
try {
parser.parseLine(line);
// do your other stuff
} catch (Exception e) {
// the line was not comma delimited, do nothing
}
}

Related

Iterating over each row and accessing every column in HTTP Sampler in JMeter

I have explored and tried solution mentioned on Google or StackOverflow but could not solve my problem.
I am trying to iterate over each row of CSV and use every column of a row in "HTTP Sampler"
This is what I have tried till now.
My Test plan structure
This is my CSV file
This is my CSV Data Set Config
I am reading entire CSV and storing values in JMeter properties variable using Bean Shell Sampler.
This is the code in Bean Shell Sampler
import java.text.*;
import java.io.*;
String filename = "load_test_date.csv";
ArrayList strList = new ArrayList();
try{
log.info("starting bean shell");
File file = new File(filename);
if(!file.exists()){
throw new Exception ("ERROR: file " + filename + " not found");
}
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String line = null;
log.info("while loop starting");
headerLine = br.readLine();
while((line = br.readLine())!=null){
log.info(line);
String[] variables = line.split(",");
props.put("header_1",variables[0]);
props.put("header_2",variables[1]);
props.put("header_3",variables[2]);
props.put("header_4",variables[3]);
props.put("header_5",variables[4]);
}
}catch(Exception ex){
log.error(ex.getMessage());
}
Now I want to iterate over props variable and fetch each column. I tried using While controller and ForEach Controller, but it is not giving me desired output.
while controller
While loop is executing twice (instead of three times for three rows in csv file) and always using last row values
I used ForEach controller too but could not produce desired outcome
First of all, forget about Beanshell, since JMeter 3.1 you should be using JSR223 Test Elements and Groovy language for scripting.
Second, if I correctly got your point and you want to iterate all the values, i.e. from 1 to 15, you need different approach, for example read the whole file into memory, split each line by comma and create a JMeter Variable for each "cell" value, example Groovy code would be something like:
SampleResult.setIgnore()
def lines = new File('load_test_date.csv').readLines()
def counter = 1
1.upto(lines.size() - 1, { index ->
def line = lines.get(index)
line.split(',').each { column ->
vars.put('value_' + counter, column)
counter++
}
})
if you execute the script and look into Debug Sampler output you will see the following JMeter Variables
In order to iterate the generated variables you can use ForEach Controller configured like:
And use ${value} in the HTTP Request sampler to access the next "cell" value on each iteration:

In Jmeter bean shell preprocess is there any way to read the lines of CSV data file and put into an array

In Jmeter bean shell preprocess is there any way to read the lines of CSV data file and put into an array
csv file contains
data1
date2
date2
i want put all three values in to array and send to Http request in jmeter for for each controller
Thanks in Advance
If you want the Beanshell
BufferedReader reader = new BufferedReader(new FileReader("path.to.your.file.csv"));
int counter = 1;
for(String line; (line = reader.readLine()) != null; ) {
vars.put("date" + counter, line);
counter++;
}
However I don't see any value added by Beanshell here, it is recommended to avoid scripting and use JMeter components where possible. If you need to send values from CSV file consecutively I would recommend using one of the following test elements instead:
CSV Data Set Config
CSVRead or StringFromFile functions

Parsing several csv files using Spring Batch

I need to parse several csv files from a given folder. As each csv has different columns, there are separate tables in DB for each csv. I need to know
Does spring batch provide any mechanism which scans through the given folder and then I can pass those files one by one to the reader.
As I am trying to make the reader/writer generic, is it possible to just get the column header for each csv, based upon that I am trying to build tokenizer and also the insert query.
Code sample
public ItemReader<Gdp> reader1() {
FlatFileItemReader<Gdp> reader1 = new FlatFileItemReader<Gdp>();
reader1.setResource(new ClassPathResource("datagdp.csv"));
reader1.setLinesToSkip(1);
reader1.setLineMapper(new DefaultLineMapper<Gdp>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(new String[] { "region", "gdpExpend", "value" });
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Gdp>() {
{
setTargetType(Gdp.class);
}
});
}
});
return reader1;
}
Use a MultiResourceItemReader to scan all files.
I think you need a sort of classified ItemReader as MultiResourceItemReader.delegate but SB doesn't offer that so you have to write your own.
For ItemProcessor and ItemWriter SB offers a classifier-aware implementation (ClassifierCompositeItemProcessor and ClassifierCompositeItemWriter).
Obviously more different input file you have more XML config must be write,but it should be straightforward to do.
I suppose you are expecting this kind of implementation.
During the Partition Step Builder, read all the files names, file header, insert query for the writer and save the same in the Execution Context.
In the slave step, for every reader and writer, pass on the Execution context, get the file to read, file header to the tokenizer, insert query that needs to be inserted for that writer.
This resolves your question.
Answers for your questions:
I don't know about a specific mechanism on spring batch to scan files.
You can use opencsv as generic CSV reader, there are a lot of mechanisms reading files.
About OpenCSV:
If you are using maven project, try to import this dependency:
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.0</version>
</dependency>
You can read your files making an object for specific formats, or generic headers like this below:
private static List<DadosPeople> extrairDadosPeople() throws IOException {
CSVReader readerPeople = new CSVReader(new FileReader(people));
List<PeopleData> listPeople = new ArrayList<PeopleData>();
String[] nextLine;
while ((nextLine = readerPeople.readNext()) != null) {
PeopleData people = new PeopleData();
people.setIncludeData(nextLine[0]);
people.setPartnerCode(Long.valueOf(nextLine[1]));
listPeople.add(people);
}
readerPeople.close();
return listPeople;
}
There are a lot of other ways to read CSV files using opencsv:
If you want to use an Iterator style pattern, you might do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}
Or, if you might just want to slurp the whole lot into a List, just call readAll()...
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
List myEntries = reader.readAll();
which will give you a List of String[] that you can iterate over. If all else fails, check out the Javadocs here.
If you want to customize quote characters and separators, you'll find constructors that cater for supplying your own separator and quote characters. Say you're using a tab for your separator, you can do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t');
And if you single quoted your escaped characters rather than double quote them, you can use the three arg constructor:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'');
You may also skip the first few lines of the file if you know that the content doesn't start till later in the file. So, for example, you can skip the first two lines by doing:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'', 2);
Can I write csv files with opencsv?
Yes. There is a CSVWriter in the same package that follows the same semantics as the CSVReader. For example, to write a tab separated file:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), '\t');
// feed in your array (or convert your data to an array)
String[] entries = "first#second#third".split("#");
writer.writeNext(entries);
writer.close();
If you'd prefer to use your own quote characters, you may use the three arg version of the constructor, which takes a quote character (or feel free to pass in CSVWriter.NO_QUOTE_CHARACTER).
You can also customise the line terminators used in the generated file (which is handy when you're exporting from your Linux web application to Windows clients). There is a constructor argument for this purpose.
Can I dump out SQL tables to CSV?
Yes you can. There is a feature on CSVWriter so you can pass writeAll() a ResultSet.
java.sql.ResultSet myResultSet = ....
writer.writeAll(myResultSet, includeHeaders);
Is there a way to bind my CSV file to a list of Javabeans?
Yes there is. There is a set of classes to allow you to bind a CSV file to a list of JavaBeans based on column name, column position, or a custom mapping strategy. You can find the new classes in the com.opencsv.bean package. Here's how you can map to a java bean based on the field positions in your CSV file:
ColumnPositionMappingStrategy strat = new ColumnPositionMappingStrategy();
strat.setType(YourOrderBean.class);
String[] columns = new String[] {"name", "orderNumber", "id"}; // the fields to bind do in your JavaBean
strat.setColumnMapping(columns);
CsvToBean csv = new CsvToBean();
List list = csv.parse(strat, yourReader);

How do I autoscan local documents to add words to a custom dictionary?

I'd like my dictionary to know more of the words I use - and don't want to manually add all possible words as I end up typing them (I'm a biologist/bioinformatician - there's lots of jargon and specific software and species names). Instead I want to:
Take a directory of existing documents. These are PDFs or Word/latex documents of scientific articles; I guess they could be "easily" be converted to plain text.
Pull out all words that are not in the "normal" dictionary.
Add these to my local custom dictionary (on my mac that's ~/Library/Spelling/LocalDictionary. But it would make sense to add them in the libreoffice/word/ispell custom dictionaries as well.
1 and 3 are easy. How can I do 2? Thanks!
As far as I understand you want to remove duplicates (that already exist in the system dictionary). You might want to ask first, if this is really necessary, though. I guess they won't cause any problems and won't increase word-spell-checking excessively, so there is no real reason for step 2 in my opinion.
I think you'll have a much harder time with step 1. Extracting plain-text from a PDF may sound easy, but it certainly is not. You'll end up with plenty of unknown symbols. You need to fix split-words at the end of a line and you probably want to exclude equations/links/numbers/etc. before adding all these to your dictionary.
But if you have some tool to get this done and can create a couple of .txt files really containing only the words/sentences you need, then I would go with something like the following python code to "solve" the merge for your local dictionary only. Of course you can also extend this to load the system dictionary (wherever that is?) and merge it the same way I show below.
Please note that I left out any error handling on purpose.
Save as import_to_dict.py, adjust the paths to your requirements and call with python import_to_dict.py
#!/usr/bin/env python
import os,re
# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'
reg_exp = r'[\s,.|/]+' #add symbols here
with open(local_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
dictionary = set(re.split(reg_exp,f.read()))
with open(global_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
global_dictionary = set(re.split(reg_exp,f.read()))
# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
# open all files (this could easily be limited to only .txt files)
for file in files:
with open(os.path.join(root, file), 'r') as txt_f:
# read the file contents
words = txt_f.read()
# split into word-set (set guarantees no duplicates)
word_set = set(re.split(reg_exp,words))
# remove any already in dictionary existing words
missing_words = (word_set - dictionary) - global_dictionary
# add missing words to dictionary
dictionary |= missing_words
# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
f.write('\n'.join(dictionary))
Here is a basic java program that will generate a text file containing all of the unique words in a directory of plain text files, separated by a newline.
You can just replace the input directory and output file path strings with correct values for your system and run it.
import java.io.*;
import java.util.*;
public class MakeDictionary {
public static void main(String args[]) throws IOException {
Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();
String inputDir = "C:\\test";
String outputFile = "C:\\out\\dictionary.txt";
File[] files = new File(inputDir).listFiles();
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
for (File file : files) {
if (file.isFile()) {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(file.getCanonicalPath()));
String line;
while ((line = in.readLine()) != null) {
String[] words = line.split(" ");
for (String word : words) {
dictionary.put(word, true);
}
}
} finally {
if (in != null) {
in.close();
}
}
}
}
Set<String> wordset = dictionary.keySet();
Iterator<String> iter = wordset.iterator();
while(iter.hasNext()) {
out.write(iter.next());
out.newLine();
}
out.close();
}
}

PIG Loading a CSV - Map Type Error

We aim to leverage PIG for largescale log analysis of our server logs. I need to load a PIG map datatype from a file.
I tried running a sample PIG script with the following data.
A line in my CSV file, named 'test' (to be processed by PIG) looks like,
151364,[ref#R813,highway#secondary]
My PIG Script
a = LOAD 'test' using PigStorage(',') AS (id:INT, m:MAP[]);
DUMP a;
The idea is to load an int and the second element as a hashmap.
However, when I dump, the int field get parsed correctly(and gets printed in the dump) but the map field is not parsed resulting in a parsing error.
Can someone please explain if I am missing something?
I think there is a delimiter related problem (such as field-delimiter is somehow effecting parsing of map field or it is confused with map-delimiter).
When this input data is used (notice I used semicolon as field-delimiter):
151364;[ref#R813,highway#secondary]
below is the output from my grunt shell:
grunt> a = LOAD '/tmp/temp2.txt' using PigStorage(';') AS (id:int, m:[]);
grunt> dump a;
...
(151364,[highway#secondary,ref#R813])
grunt> b = foreach a generate m#'ref';
grunt> dump b;
(R813)
Atlast, I figured out the problem. Just change the de-limiter from ',' to another character ,say a pipe. The field delimiter was being confused with the delimiter ',' used for the map :)
The string 151364,[ref#R813,highway#secondary] was getting parsed into,
field1: 151364 field2: [ref#R813 field3: highway#secondary]
Since '[ref#$813' is not a valid map field, there is a parse error.
I also looked into the source code of the PigStorage function and confirmed the parsing logic - Source code
#Override
public Tuple getNext() throws IOException {
for (int i = 0; i < len; i++) {
//skipping some stuff
if (buf[i] == fieldDel) { // if we find delim
readField(buf, start, i); //extract field from prev delim to current
start = i + 1;
fieldID++;
}
}
}
Thus, since PIG splits fields by the delimiter, it causes the parsing of fields to be confused with the separator used for the map.

Resources