SimpleTextLoader UDF in Pig

SimpleTextLoader UDF in Pig - hadoop

I want to create a Custom Load function for Pig UDF, I have created a SimpleTextLoader using the link
https://pig.apache.org/docs/r0.11.0/udf.html , I have successfully generate the jar file for this code, register in pig and run a Pig Script.I am getting the empty output. I don't know how to solve this issue, any help would be appreciated.
Below is my Java code
public class SimpleTextLoader extends LoadFunc{
protected RecordReader in = null;
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
private static final int BUFFER_SIZE = 1024;
public SimpleTextLoader() {
}
public SimpleTextLoader(String delimiter)
{
this();
if (delimiter.length() == 1) {
this.fieldDel = (byte)delimiter.charAt(0);
} else if (delimiter.length() > 1 && delimiter.charAt(0) == '\\') {
switch (delimiter.charAt(1)) {
case 't':
this.fieldDel = (byte)'\t';
break;
case 'x':
fieldDel =
Integer.valueOf(delimiter.substring(2), 16).byteValue();
break;
case 'u':
this.fieldDel =
Integer.valueOf(delimiter.substring(2)).byteValue();
break;
default:
throw new RuntimeException("Unknown delimiter " + delimiter);
}
} else {
throw new RuntimeException("PigStorage delimeter must be a single character");
}
}
private void readField(byte[] buf, int start, int end) {
if (mProtoTuple == null) {
mProtoTuple = new ArrayList<Object>();
}
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
} #Override
public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (notDone) {
return null;
}
Text value = (Text) in.getCurrentValue();
System.out.println("printing value" +value);
byte[] buf = value.getBytes();
int len = value.getLength();
int start = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
readField(buf, start, i);
start = i + 1;
}
}
// pick up the last field
readField(buf, start, len);
Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple);
mProtoTuple = null;
System.out.println(t);
return t;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
}
#Override
public void setLocation(String string, Job job) throws IOException {
FileInputFormat.setInputPaths(job,string);
}
#Override
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat();
}
#Override
public void prepareToRead(RecordReader reader, PigSplit ps) throws IOException {
in=reader;
}
}
Below is my Pig Script
REGISTER /home/hadoop/netbeans/sampleloader/dist/sampleloader.jar
a= load '/input.txt' using sampleloader.SimpleTextLoader();
store a into 'output';

You are using sampleloader.SimpleTextLoader() that doesn't do anything as it is just an empty constructor.
Instead use sampleloader.SimpleTextLoader(String delimiter) which is performing the actual operation of split.

Related

How to export huge result set from database into several csv files and zip them on the fly?

I need to create a REST controller which extracts data from a database and write it into CSV files that will ultimately be zipped together. Each CSV file should contain exactly 10 lines. Eventually all CSV files should be zipped into a one zip file. I want everything to happen on the fly, meaning - saving files to a temporary location on the disk is not an option. Can someone provide me with an example?

I found a very nice code to export huge amount of rows from database into several csv files and zip it.
I think this is a nice code that can assist alot of developers.
I have tested the solution and you can find the entire example at : https://github.com/idaamit/stream-from-db/tree/master
The conroller is :
#GetMapping(value = "/employees/{employeeId}/cars") #ResponseStatus(HttpStatus.OK) public ResponseEntity<StreamingResponseBody> getEmployeeCars(#PathVariable int employeeId) {
log.info("Going to export cars for employee {}", employeeId);
String zipFileName = "Cars Of Employee - " + employeeId;
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_TYPE, "application/zip")
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment;filename=" + zipFileName + ".zip")
.body(
employee.getCars(dataSource, employeeId));
The employee class, first checks if we need to prepare more than one csv or not :
public class Employee {
public StreamingResponseBody getCars(BasicDataSource dataSource, int employeeId) {
StreamingResponseBody streamingResponseBody = new StreamingResponseBody() {
#Override
public void writeTo(OutputStream outputStream) throws IOException {
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
String sqlQuery = "SELECT [Id], [employeeId], [type], [text1] " +
"FROM Cars " +
"WHERE EmployeeID=? ";
PreparedStatementSetter preparedStatementSetter = new PreparedStatementSetter() {
public void setValues(PreparedStatement preparedStatement) throws SQLException {
preparedStatement.setInt(1, employeeId);
}
};
StreamingZipResultSetExtractor zipExtractor = new StreamingZipResultSetExtractor(outputStream, employeeId, isMoreThanOneFile(jdbcTemplate, employeeId));
Integer numberOfInteractionsSent = jdbcTemplate.query(sqlQuery, preparedStatementSetter, zipExtractor);
}
};
return streamingResponseBody;
}
private boolean isMoreThanOneFile(JdbcTemplate jdbcTemplate, int employeeId) {
Integer numberOfCars = getCount(jdbcTemplate, employeeId);
return numberOfCars >= StreamingZipResultSetExtractor.MAX_ROWS_IN_CSV;
}
private Integer getCount(JdbcTemplate jdbcTemplate, int employeeId) {
String sqlQuery = "SELECT count([Id]) " +
"FROM Cars " +
"WHERE EmployeeID=? ";
return jdbcTemplate.queryForObject(sqlQuery, new Object[] { employeeId }, Integer.class);
}
}
This class StreamingZipResultSetExtractor is responsible to split the csv streaming data into several files and zip it.
#Slf4j
public class StreamingZipResultSetExtractor implements ResultSetExtractor<Integer> {
private final static int CHUNK_SIZE = 100000;
public final static int MAX_ROWS_IN_CSV = 10;
private OutputStream outputStream;
private int employeeId;
private StreamingCsvResultSetExtractor streamingCsvResultSetExtractor;
private boolean isInteractionCountExceedsLimit;
private int fileCount = 0;
public StreamingZipResultSetExtractor(OutputStream outputStream, int employeeId, boolean isInteractionCountExceedsLimit) {
this.outputStream = outputStream;
this.employeeId = employeeId;
this.streamingCsvResultSetExtractor = new StreamingCsvResultSetExtractor(employeeId);
this.isInteractionCountExceedsLimit = isInteractionCountExceedsLimit;
}
#Override
#SneakyThrows
public Integer extractData(ResultSet resultSet) throws DataAccessException {
log.info("Creating thread to extract data as zip file for employeeId {}", employeeId);
int lineCount = 1; //+1 for header row
try (PipedOutputStream internalOutputStream = streamingCsvResultSetExtractor.extractData(resultSet);
PipedInputStream InputStream = new PipedInputStream(internalOutputStream);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(InputStream))) {
String currentLine;
String header = bufferedReader.readLine() + "\n";
try (ZipOutputStream zipOutputStream = new ZipOutputStream(outputStream)) {
createFile(employeeId, zipOutputStream, header);
while ((currentLine = bufferedReader.readLine()) != null) {
if (lineCount % MAX_ROWS_IN_CSV == 0) {
zipOutputStream.closeEntry();
createFile(employeeId, zipOutputStream, header);
lineCount++;
}
lineCount++;
currentLine += "\n";
zipOutputStream.write(currentLine.getBytes());
if (lineCount % CHUNK_SIZE == 0) {
zipOutputStream.flush();
}
}
}
} catch (IOException e) {
log.error("Task {} could not zip search results", employeeId, e);
}
log.info("Finished zipping all lines to {} file\\s - total of {} lines of data for task {}", fileCount, lineCount - fileCount, employeeId);
return lineCount;
}
private void createFile(int employeeId, ZipOutputStream zipOutputStream, String header) {
String fileName = "Cars for Employee - " + employeeId;
if (isInteractionCountExceedsLimit) {
fileCount++;
fileName += " Part " + fileCount;
}
try {
zipOutputStream.putNextEntry(new ZipEntry(fileName + ".csv"));
zipOutputStream.write(header.getBytes());
} catch (IOException e) {
log.error("Could not create new zip entry for task {} ", employeeId, e);
}
}
}
The class StreamingCsvResultSetExtractor is responsible for transfer the data from the resultset into csv file. There is more work to do to handle special character set which are problematic in csv cell.
#Slf4j
public class StreamingCsvResultSetExtractor implements ResultSetExtractor<PipedOutputStream> {
private final static int CHUNK_SIZE = 100000;
private PipedOutputStream pipedOutputStream;
private final int employeeId;
public StreamingCsvResultSetExtractor(int employeeId) {
this.employeeId = employeeId;
}
#SneakyThrows
#Override
public PipedOutputStream extractData(ResultSet resultSet) throws DataAccessException {
log.info("Creating thread to extract data as csv and save to file for task {}", employeeId);
this.pipedOutputStream = new PipedOutputStream();
ExecutorService executor = Executors.newSingleThreadExecutor();
executor.submit(() -> {
prepareCsv(resultSet);
});
return pipedOutputStream;
}
#SneakyThrows
private Integer prepareCsv(ResultSet resultSet) {
int interactionsSent = 1;
log.info("starting to extract data to csv lines");
streamHeaders(resultSet.getMetaData());
StringBuilder csvRowBuilder = new StringBuilder();
try {
int columnCount = resultSet.getMetaData().getColumnCount();
while (resultSet.next()) {
for (int i = 1; i < columnCount + 1; i++) {
if(resultSet.getString(i) != null && resultSet.getString(i).contains(",")){
String strToAppend = "\"" + resultSet.getString(i) + "\"";
csvRowBuilder.append(strToAppend);
} else {
csvRowBuilder.append(resultSet.getString(i));
}
csvRowBuilder.append(",");
}
int rowLength = csvRowBuilder.length();
csvRowBuilder.replace(rowLength - 1, rowLength, "\n");
pipedOutputStream.write(csvRowBuilder.toString().getBytes());
interactionsSent++;
csvRowBuilder.setLength(0);
if (interactionsSent % CHUNK_SIZE == 0) {
pipedOutputStream.flush();
}
}
} finally {
pipedOutputStream.flush();
pipedOutputStream.close();
}
log.debug("Created all csv lines for Task {} - total of {} rows", employeeId, interactionsSent);
return interactionsSent;
}
#SneakyThrows
private void streamHeaders(ResultSetMetaData resultSetMetaData) {
StringBuilder headersCsvBuilder = new StringBuilder();
for (int i = 1; i < resultSetMetaData.getColumnCount() + 1; i++) {
headersCsvBuilder.append(resultSetMetaData.getColumnLabel(i)).append(",");
}
int rowLength = headersCsvBuilder.length();
headersCsvBuilder.replace(rowLength - 1, rowLength, "\n");
pipedOutputStream.write(headersCsvBuilder.toString().getBytes());
}
}
In order to test this, you need to execute http://localhost:8080/stream-demo/employees/3/cars

Spring Error while using filter and wrapper

I'm using the filter to check user rights.
Problem in comparing session value to param value is occurred and resolution load is applied using wrapper.
However, the following error message came out.
List<Map<String,Object>> loginInfo = (List<Map<String,Object>>)session.getAttribute("loginSession");
if loginInfo.get(0).get("user_type").equals("1") || loginInfo.get(0).get("user_type").equals("2"))
{
chain.doFilter(req, res);
}
else
{
RereadableRequestWrapper wrapperRequest = new RereadableRequestWrapper(request);
String requestBody= IOUtils.toString(wrapperRequest.getInputStream(), "UTF-8");
Enumeration<String> reqeustNames = request.getParameterNames();
if(requestBody == null) {
}
Map<String,Object> param_map = new ObjectMapper().readValue(requestBody, HashMap.class);
String userId_param = String.valueOf(param_map.get("customer_id"));
System.out.println(userId_param);
if( userId_param == null || userId_param.isEmpty()) {
logger.debug("error, customer_id error");
}
if (!loginInfo.get(0).get("customer_id").equals(userId_param))
{
logger.debug("error, customer_id error");
}
chain.doFilter(wrapperRequest, res);
}
/////////////////////////
here is my wrapper Code.
private boolean parametersParsed = false;
private final Charset encoding;
private final byte[] rawData;
private final Map<String, ArrayList<String>> parameters = new LinkedHashMap<String, ArrayList<String>>();
ByteChunk tmpName = new ByteChunk();
ByteChunk tmpValue = new ByteChunk();
private class ByteChunk {
private byte[] buff;
private int start = 0;
private int end;
public void setByteChunk(byte[] b, int off, int len) {
buff = b;
start = off;
end = start + len;
}
public byte[] getBytes() {
return buff;
}
public int getStart() {
return start;
}
public int getEnd() {
return end;
}
public void recycle() {
buff = null;
start = 0;
end = 0;
}
}
public RereadableRequestWrapper(HttpServletRequest request) throws IOException {
super(request);
String characterEncoding = request.getCharacterEncoding();
if (StringUtils.isBlank(characterEncoding)) {
characterEncoding = StandardCharsets.UTF_8.name();
}
this.encoding = Charset.forName(characterEncoding);
// Convert InputStream data to byte array and store it to this wrapper instance.
try {
InputStream inputStream = request.getInputStream();
this.rawData = IOUtils.toByteArray(inputStream);
} catch (IOException e) {
throw e;
}
}
#Override
public ServletInputStream getInputStream() throws IOException {
final ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(this.rawData);
ServletInputStream servletInputStream = new ServletInputStream() {
public int read() throws IOException {
return byteArrayInputStream.read();
}
#Override
public boolean isFinished() {
// TODO Auto-generated method stub
return false;
}
#Override
public boolean isReady() {
// TODO Auto-generated method stub
return false;
}
#Override
public void setReadListener(ReadListener listener) {
// TODO Auto-generated method stub
}
};
return servletInputStream;
}
#Override
public BufferedReader getReader() throws IOException {
return new BufferedReader(new InputStreamReader(this.getInputStream(), this.encoding));
}
#Override
public ServletRequest getRequest() {
return super.getRequest();
}
#Override
public String getParameter(String name) {
if (!parametersParsed) {
parseParameters();
}
ArrayList<String> values = this.parameters.get(name);
if (values == null || values.size() == 0)
return null;
return values.get(0);
}
public HashMap<String, String[]> getParameters() {
if (!parametersParsed) {
parseParameters();
}
HashMap<String, String[]> map = new HashMap<String, String[]>(this.parameters.size() * 2);
for (String name : this.parameters.keySet()) {
ArrayList<String> values = this.parameters.get(name);
map.put(name, values.toArray(new String[values.size()]));
}
return map;
}
#SuppressWarnings("rawtypes")
#Override
public Map getParameterMap() {
return getParameters();
}
#SuppressWarnings("rawtypes")
#Override
public Enumeration getParameterNames() {
return new Enumeration<String>() {
#SuppressWarnings("unchecked")
private String[] arr = (String[])(getParameterMap().keySet().toArray(new String[0]));
private int index = 0;
#Override
public boolean hasMoreElements() {
return index < arr.length;
}
#Override
public String nextElement() {
return arr[index++];
}
};
}
#Override
public String[] getParameterValues(String name) {
if (!parametersParsed) {
parseParameters();
}
ArrayList<String> values = this.parameters.get(name);
String[] arr = values.toArray(new String[values.size()]);
if (arr == null) {
return null;
}
return arr;
}
private void parseParameters() {
parametersParsed = true;
if (!("application/x-www-form-urlencoded".equalsIgnoreCase(super.getContentType()))) {
return;
}
int pos = 0;
int end = this.rawData.length;
while (pos < end) {
int nameStart = pos;
int nameEnd = -1;
int valueStart = -1;
int valueEnd = -1;
boolean parsingName = true;
boolean decodeName = false;
boolean decodeValue = false;
boolean parameterComplete = false;
do {
switch (this.rawData[pos]) {
case '=':
if (parsingName) {
// Name finished. Value starts from next character
nameEnd = pos;
parsingName = false;
valueStart = ++pos;
} else {
// Equals character in value
pos++;
}
break;
case '&':
if (parsingName) {
// Name finished. No value.
nameEnd = pos;
} else {
// Value finished
valueEnd = pos;
}
parameterComplete = true;
pos++;
break;
case '%':
case '+':
// Decoding required
if (parsingName) {
decodeName = true;
} else {
decodeValue = true;
}
pos++;
break;
default:
pos++;
break;
}
} while (!parameterComplete && pos < end);
if (pos == end) {
if (nameEnd == -1) {
nameEnd = pos;
} else if (valueStart > -1 && valueEnd == -1) {
valueEnd = pos;
}
}
if (nameEnd <= nameStart) {
continue;
// ignore invalid chunk
}
tmpName.setByteChunk(this.rawData, nameStart, nameEnd - nameStart);
if (valueStart >= 0) {
tmpValue.setByteChunk(this.rawData, valueStart, valueEnd - valueStart);
} else {
tmpValue.setByteChunk(this.rawData, 0, 0);
}
try {
String name;
String value;
if (decodeName) {
name = new String(URLCodec.decodeUrl(Arrays.copyOfRange(tmpName.getBytes(), tmpName.getStart(), tmpName.getEnd())), this.encoding);
} else {
name = new String(tmpName.getBytes(), tmpName.getStart(), tmpName.getEnd() - tmpName.getStart(), this.encoding);
}
if (valueStart >= 0) {
if (decodeValue) {
value = new String(URLCodec.decodeUrl(Arrays.copyOfRange(tmpValue.getBytes(), tmpValue.getStart(), tmpValue.getEnd())), this.encoding);
} else {
value = new String(tmpValue.getBytes(), tmpValue.getStart(), tmpValue.getEnd() - tmpValue.getStart(), this.encoding);
}
} else {
value = "";
}
if (StringUtils.isNotBlank(name)) {
ArrayList<String> values = this.parameters.get(name);
if (values == null) {
values = new ArrayList<String>(1);
this.parameters.put(name, values);
}
if (StringUtils.isNotBlank(value)) {
values.add(value);
}
}
} catch (DecoderException e) {
// ignore invalid chunk
}
tmpName.recycle();
tmpValue.recycle();
}
}
and Error Message is com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
I Don't know why this problem happened...

Queue data structure requiring K accesses before removal

I need a specialized queue-like data structure. It can be used by multiple consumers, but each item in queue must be removed from queue after k consumers read it.
Is there any production ready implementation? Or Should I implement a queue with read-counter in each item, and handle item removal myself?
Thanks in advance.

I think this is what you are looking for. Derived from the source code for BlockingQueue. Caveat emptor, not tested.
I tried to find a way to wrap Queue, but Queue doesn't expose its concurrency members, so you can't get the right semantics.
public class CountingQueue<E> {
private class Entry {
Entry(int count, E element) {
this.count = count;
this.element = element;
}
int count;
E element;
}
public CountingQueue(int capacity) {
if (capacity <= 0) {
throw new IllegalArgumentException();
}
this.items = new Object[capacity];
this.lock = new ReentrantLock(false);
this.condition = this.lock.newCondition();
}
private final ReentrantLock lock;
private final Condition condition;
private final Object[] items;
private int takeIndex;
private int putIndex;
private int count;
final int inc(int i) {
return (++i == items.length) ? 0 : i;
}
final int dec(int i) {
return ((i == 0) ? items.length : i) - 1;
}
private static void checkNotNull(Object v) {
if (v == null)
throw new NullPointerException();
}
/**
* Inserts element at current put position, advances, and signals.
* Call only when holding lock.
*/
private void insert(int count, E x) {
items[putIndex] = new Entry(count, x);
putIndex = inc(putIndex);
if (count++ == 0) {
// empty to non-empty
condition.signal();
}
}
private E extract() {
Entry entry = (Entry)items[takeIndex];
if (--entry.count <= 0) {
items[takeIndex] = null;
takeIndex = inc(takeIndex);
if (count-- == items.length) {
// full to not-full
condition.signal();
}
}
return entry.element;
}
private boolean waitNotEmpty(long timeout, TimeUnit unit) throws InterruptedException {
long nanos = unit.toNanos(timeout);
while (count == 0) {
if (nanos <= 0) {
return false;
}
nanos = this.condition.awaitNanos(nanos);
}
return true;
}
private boolean waitNotFull(long timeout, TimeUnit unit) throws InterruptedException {
long nanos = unit.toNanos(timeout);
while (count == items.length) {
if (nanos <= 0)
return false;
nanos = condition.awaitNanos(nanos);
}
return true;
}
public boolean put(int count, E e) {
checkNotNull(e);
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
if (count == items.length)
return false;
else {
insert(count, e);
return true;
}
} finally {
localLock.unlock();
}
}
public boolean put(int count, E e, long timeout, TimeUnit unit)
throws InterruptedException {
checkNotNull(e);
final ReentrantLock localLock = this.lock;
localLock.lockInterruptibly();
try {
if (!waitNotFull(timeout, unit)) {
return false;
}
insert(count, e);
return true;
} finally {
localLock.unlock();
}
}
public E get() {
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
return (count == 0) ? null : extract();
} finally {
localLock.unlock();
}
}
public E get(long timeout, TimeUnit unit) throws InterruptedException {
final ReentrantLock localLock = this.lock;
localLock.lockInterruptibly();
try {
if (waitNotEmpty(timeout, unit)) {
return extract();
} else {
return null;
}
} finally {
localLock.unlock();
}
}
public int size() {
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
return count;
} finally {
localLock.unlock();
}
}
public boolean isEmpty() {
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
return count == 0;
} finally {
localLock.unlock();
}
}
public int remainingCapacity() {
final ReentrantLock lock= this.lock;
lock.lock();
try {
return items.length - count;
} finally {
lock.unlock();
}
}
public boolean isFull() {
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
return items.length - count == 0;
} finally {
localLock.unlock();
}
}
public void clear() {
final ReentrantLock localLock = this.lock;
localLock.lock();
try {
for (int i = takeIndex, k = count; k > 0; i = inc(i), k--)
items[i] = null;
count = 0;
putIndex = 0;
takeIndex = 0;
condition.signalAll();
} finally {
localLock.unlock();
}
}
}

A memory efficient way that retains the info you need:
Each queue entry becomes a
Set<ConsumerID>
so that you ensure the k times are for k distinct consumers: your app logic checks if the
set.size()==k
and removes it from queue in that case.
In terms of storage: you will have tradeoffs of which Set implementation based on
size and type of the ConsumerID
speed of retrieval requirement
E.g if k is very small and your queue retrieval logic has access to a
Map<ID,ConsumerId>
then you could have simply an Int or even a Short or Byte depending on # distinct ConsumerID's and possibly store in an Array . This is slower than accessing a set since it would be traversed linearly - but for small K that may be reasonable.

Reduce doesn't run but job is successfully completed

Firstly, I am a newbie at Hadoop MapReduce. My reducer does not run but shows that the job is successfully completed. Below is my console output :
INFO mapreduce.Job: Running job: job_1418240815217_0015
INFO mapreduce.Job: Job job_1418240815217_0015 running in uber mode : false
INFO mapreduce.Job: map 0% reduce 0%
INFO mapreduce.Job: map 100% reduce 0%
INFO mapreduce.Job: Job job_1418240815217_0015 completed successfully
INFO mapreduce.Job: Counters: 30
The main class is :
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf,"NPhase2");
job.setJarByClass(NPhase2.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(NPhase2Value.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
int numberOfPartition = 0;
List<String> other_args = new ArrayList<String>();
for(int i = 0; i < args.length; ++i)
{
try {
if ("-m".equals(args[i])) {
//conf.setNumMapTasks(Integer.parseInt(args[++i]));
++i;
} else if ("-r".equals(args[i])) {
job.setNumReduceTasks(Integer.parseInt(args[++i]));
} else if ("-k".equals(args[i])) {
int knn = Integer.parseInt(args[++i]);
conf.setInt("knn", knn);
System.out.println(knn);
} else {
other_args.add(args[i]);
}
job.setNumReduceTasks(numberOfPartition * numberOfPartition);
//conf.setNumReduceTasks(1);
} catch (NumberFormatException except) {
System.out.println("ERROR: Integer expected instead of " + args[i]);
} catch (ArrayIndexOutOfBoundsException except) {
System.out.println("ERROR: Required parameter missing from " + args[i-1]);
}
}
// Make sure there are exactly 2 parameters left.
if (other_args.size() != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
other_args.size() + " instead of 2.");
}
FileInputFormat.setInputPaths(job, other_args.get(0));
FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
My mapper is :
public static class MapClass extends Mapper
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
String[] parts = line.split("\\s+");
// key format <rid1>
IntWritable mapKey = new IntWritable(Integer.valueOf(parts[0]));
// value format <rid2, dist>
NPhase2Value np2v = new NPhase2Value(Integer.valueOf(parts[1]), Float.valueOf(parts[2]));
context.write(mapKey, np2v);
}
}
My reducer class is :
public static class Reduce extends Reducer<IntWritable, NPhase2Value, NullWritable, Text>
{
int numberOfPartition;
int knn;
class Record
{
public int id2;
public float dist;
Record(int id2, float dist)
{
this.id2 = id2;
this.dist = dist;
}
public String toString()
{
return Integer.toString(id2) + " " + Float.toString(dist);
}
}
class RecordComparator implements Comparator<Record>
{
public int compare(Record o1, Record o2)
{
int ret = 0;
float dist = o1.dist - o2.dist;
if (Math.abs(dist) < 1E-6)
ret = o1.id2 - o2.id2;
else if (dist > 0)
ret = 1;
else
ret = -1;
return -ret;
}
}
public void setup(Context context)
{
Configuration conf = new Configuration();
conf = context.getConfiguration();
numberOfPartition = conf.getInt("numberOfPartition", 2);
knn = conf.getInt("knn", 3);
}
public void reduce(IntWritable key, Iterator<NPhase2Value> values, Context context) throws IOException, InterruptedException
{
//initialize the pq
RecordComparator rc = new RecordComparator();
PriorityQueue<Record> pq = new PriorityQueue<Record>(knn + 1, rc);
// For each record we have a reduce task
// value format <rid1, rid2, dist>
while (values.hasNext())
{
NPhase2Value np2v = values.next();
int id2 = np2v.getFirst().get();
float dist = np2v.getSecond().get();
Record record = new Record(id2, dist);
pq.add(record);
if (pq.size() > knn)
pq.poll();
}
while(pq.size() > 0)
{
context.write(NullWritable.get(), new Text(key.toString() + " " + pq.poll().toString()));
//break; // only ouput the first record
}
} // reduce
}
This is my helper class :
public class NPhase2Value implements WritableComparable {
private IntWritable first;
private FloatWritable second;
public NPhase2Value() {
set(new IntWritable(), new FloatWritable());
}
public NPhase2Value(int first, float second) {
set(new IntWritable(first), new FloatWritable(second));
}
public void set(IntWritable first, FloatWritable second) {
this.first = first;
this.second = second;
}
public IntWritable getFirst() {
return first;
}
public FloatWritable getSecond() {
return second;
}
#Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
#Override
public boolean equals(Object o) {
if (o instanceof NPhase2Value) {
NPhase2Value np2v = (NPhase2Value) o;
return first.equals(np2v.first) && second.equals(np2v.second);
}
return false;
}
#Override
public String toString() {
return first.toString() + " " + second.toString();
}
#Override
public int compareTo(NPhase2Value np2v) {
return 1;
}
}
The command line command I use is :
hadoop jar knn.jar NPhase2 -m 1 -r 3 -k 4 phase1out phase2out
I am trying hard to figure out the error but still not able to come up with solution. Please help me in this regards as I am running on a tight schedule.

Because you have set the number of reducer task as 0. See this:
int numberOfPartition = 0;
//.......
job.setNumReduceTasks(numberOfPartition * numberOfPartition);
I dont see you have resetted numberOfPartition anywhere in your code. I thins you should set it where you are parsing -r option or remove call to setNumReduceTasks method as above completely as you are setting it already while parsing -r option.

Mybatis Custom Type handler: call FileInputStream.close() after query being executed

I am trying to implement MyBatis custom type handler for File using FileInputStream.
here is my code for setting:
#MappedJdbcTypes(JdbcType.LONGVARBINARY)
public class FileByteaHandler extends BaseTypeHandler<File> {
#Override
public void setNonNullParameter(PreparedStatement ps, int i, File file, JdbcType jdbcType) throws SQLException{
try {
FileInputStream fis = new FileInputStream(file);
ps.setBinaryStream(1, fis, (int) file.length());
} catch(FileNotFoundException ex) {
Logger.getLogger(FileByteaHandler.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
My question is:
I can not close this FileInputStream at the end of this method, otherwise MyBatis will not be able to read the data from it. In fact, I do not know where I can close the FileInputStream. Is there a way to call close() after the query being excuted in MyBatis.
Thanks in advance,
UPDATE
Thanks for Jarandinor's help. Here is my code for this type handler. and hopefully it can help someone:
#MappedJdbcTypes(JdbcType.LONGVARBINARY)
public class FileByteaHandler extends BaseTypeHandler<File> {
#Override
public void setNonNullParameter(PreparedStatement ps, int i, File file, JdbcType jdbcType) throws SQLException {
try {
AutoCloseFileInputStream fis = new AutoCloseFileInputStream(file);
ps.setBinaryStream(1, fis, (int) file.length());
} catch(FileNotFoundException ex) {
Logger.getLogger(FileByteaHandler.class.getName()).log(Level.SEVERE, null, ex);
}
}
#Override
public File getNullableResult(ResultSet rs, String columnName) throws SQLException {
File file = null;
try(InputStream input = rs.getBinaryStream(columnName)) {
file = getResult(rs, input);
} catch(IOException e) {
System.out.println(e.getMessage());
}
return file;
}
public File creaetFile() {
File file = new File("e:/target-file"); //your temp file path
return file;
}
private File getResult(ResultSet rs, InputStream input) throws SQLException {
File file = creaetFile();
try(OutputStream output = new FileOutputStream(file)) {
int bufSize = 0x8000000;
byte buf[] = new byte[bufSize];
int s = 0;
int tl = 0;
while( (s = input.read(buf, 0, bufSize)) > 0 ) {
output.write(buf, 0, s);
tl += s;
}
output.flush();
} catch(IOException e) {
System.out.println(e.getMessage());
}
return file;
}
#Override
public File getNullableResult(ResultSet rs, int columnIndex) throws SQLException {
File file = null;
try(InputStream input = rs.getBinaryStream(columnIndex)) {
file = getResult(rs, input);
} catch(IOException e) {
System.out.println(e.getMessage());
}
return file;
}
#Override
public File getNullableResult(CallableStatement cs, int columnIndex) throws SQLException {
throw new SQLException("getNullableResult(CallableStatement cs, int columnIndex) is called");
}
private class AutoCloseFileInputStream extends FileInputStream {
public AutoCloseFileInputStream(File file) throws FileNotFoundException {
super(file);
}
#Override
public int read() throws IOException {
int c = super.read();
if(available() <= 0) {
close();
}
return c;
}
public int read(byte[] b) throws IOException {
int c = super.read(b);
if(available() <= 0) {
close();
}
return c;
}
public int read(byte[] b, int off, int len) throws IOException {
int c = super.read(b, off, len);
if(available() <= 0) {
close();
}
return c;
}
}
}
public AutoCloseFileInputStream(File file) throws FileNotFoundException {
super(file);
}
#Override
public int read() throws IOException {
int c = super.read();
if( c == -1 ) {
close();
}
return c;
}
public int read(byte[] b) throws IOException {
int c = super.read(b);
if( c == -1 ) {
close();
}
return c;
}
public int read(byte[] b, int off, int len) throws IOException {
int c = super.read(b, off, len);
if(available() <= 0) {
close();
}
return c;
}
}

I don't know a good way to close stream after query execution.
Method 1:
read the file to byte []
(note: in jdk 7 you can use Files.readAllBytes(Paths.get(file.getPath()));)
and use:
ps.setBytes(i, bytes);
2: or create your own class inherited from FileInputStream and override public native int read() throws IOException; method, when the end of the file is reached, close the stream:
#Override
public int read() throws IOException {
int c = super.read();
if(c == -1) {
super.close();
}
return c;
}
Maybe you should override and public int read(byte[] b) throws IOException too,
it's depends on the jdbc implementation.
3: you can change your FileByteaHandler:
1) add list of FileInputStream field;
2) put opened InputStream to that list in setNonNullParameter;
3) add closeStreams() method, where you close and remove all InputStream from list.
And invoke this method after you have invoked your mapper method: session.getConfiguration().getTypeHandlerRegistry().getMappingTypeHandler(FileByteaHandler.class).closeStreams();
Or use mybatis plugin system to run above command.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

SimpleTextLoader UDF in Pig - hadoop

You are using sampleloader.SimpleTextLoader() that doesn't do anything as it is just an empty constructor. Instead use sampleloader.SimpleTextLoader(String delimiter) which is performing the actual operation of split.

Related

How to export huge result set from database into several csv files and zip them on the fly?

Spring Error while using filter and wrapper

Queue data structure requiring K accesses before removal

Reduce doesn't run but job is successfully completed

Mybatis Custom Type handler: call FileInputStream.close() after query being executed

Categories

Resources