Pig replace command

Pig replace command - hadoop

I have one file that has data like
11/16/2015,"others (phone,health,etc.)",cont'd attempts,"resource,inc.",dg
I want to remove comma's present only inside double quotes.
Expected Result
11/162015,"others(phone health etc.)",cont'd attempts,"resource inc.",dg
So far what I tried
Foreach a generate replace ($1,',','');
Foreach a generate regex_extract($1,'[\,]+',1);
But none of them work.

Frist of all use REGULAR EXP to separate field in the tuple and then apply the REPLACE
Try this code :
a = load '<path>' as line;
b = foreach a generate FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,]["](.*)["][,](.*)[,]["](.*)["][,](.*)'));
c = foreach b generate $0,REPLACE($1,',',' '),$2,REPLACE($3,',',' '),$4;
dump c;

can be achievable using a UDF, which can look at all fields in each of the tuple passed.
import java.util.HashMap;
import java.util.Iterator;
import java.util.ArrayList;
import java.io.IOException;
import java.lang.Long;
import java.lang.Exception;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataType;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.backend.executionengine.ExecException;
public class CommaRemove extends EvalFunc<DataBag> {
#Override
public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
int inputSize = input.size();
Tuple output = TupleFactory.getInstance().newTuple(inputSize);
for( int i = 0; i < inputSize ; i++)
{
output.set(i, input.get(i).replace(',',''));
}
return output;
} catch (Exception e) {
System.err.println("Failed to process input; error - " + e.getMessage());
return null;
}
}
}

Related

Parse XML contains incrementing namespace numbers and multiple namespaces

Parse XML contains incrementing namespace numbers and multiple namespaces, this xml is a service which can't be updated. The original approach was to simply ummarshal it to the java objects and be on our way. The provider uses older tools Castor to create the message, which we have no access to. The plan is to parse it then marshal/unmarshal it.
<TESTXmlResponse xmlns="TEST/TESTXmlResponse">
<firstRequest>
<ns1:xmlRquest xmls:ns1="TEST/XMLRequest">
<ns2:username xmls:ns2="TEST/XMLUserNameRequest">
<ns3:value xmls:ns3=TEST/XMLValueRequest">test</ns3:value>
</ns2:username>
</ns1:xmlRquest>
</firstRequest>
<data>
<ns4:name xmls:ns4="TEST/XMLConstants">name1</ns4:name>
<ns5:value xmls:ns5=TEST/XMLConstants">data1</ns5:value>
</data>
<data>
<ns6:name xmls:ns6="TEST/XMLConstants">name2</ns6:name>
<ns7:value xmls:ns7=TEST/XMLConstants">data2</ns7:value>
</data>
<data>
<ns8:name xmls:ns8="TEST/XMLConstants">name3</ns8:name>
<ns9:value xmls:ns9=TEST/XMLConstants">data3/ns9:value>
</data>
</TESTXmlResponse>
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Iterator;
import javax.xml.XMLConstants;
import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Main
{
public static void main(String[] args) throws Exception
{
ArrayList<String> constants = new ArrayList<String>();
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new
FileInputStream(newfile("constants.xml")));
Get XPath expression
XPathFactory xpathfactory = XPathFactory.newInstance();
XPath xpath = xpathfactory.newXPath();
xpath.setNamespaceContext(new NamespaceResolver(doc));
XPathExpression expr =
xpath.compile("//firstRequest/ns1:xmlRequest/ns2:username/ns3:value/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
constants.add(nodes.item(i).getNodeValue());
}
if (constants.size() > 0){
System.out.println(constants);
}
}
class NamespaceResolver implements NamespaceContext
{
private Document sourceDocument;
public NamespaceResolver(Document document) {
sourceDocument = document;
}
public String getNamespaceURI(String prefix) {
if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return sourceDocument.lookupNamespaceURI(null);
} else {
return sourceDocument.lookupNamespaceURI(prefix);
}
}
public String getPrefix(String namespaceURI) {
return sourceDocument.lookupPrefix(namespaceURI);
}
#SuppressWarnings("rawtypes")
public Iterator getPrefixes(String namespaceURI) {
return null;
}
}
nodelist returned from the expr is where the issue lies, isn't null, but the length is zero 0.
I've looked at several examples, this one seems to be the closet to a solution. The XPathExpression expr appears to be the issue. Refining it for each case would seem to be a reasonable approach.

Find count of rows with empty value in Hbase

I have populated a Hbase table with rowid and vrious information pertaining to tweet such as clean-text,url,hashtag etc. as follows
902221655086211073 column=clean-tweet:clean-text-cta, timestamp=1514793745304, value=democrat mayor order hurricane harvey stand houston
However while populating I noticed that the some of the rows are empty like
902487280543305728 column=clean-tweet:clean-text-cta, timestamp=1514622371008, value=
Now how do I find the count of rows that are having data?
Please help me in this

There is no provision to do this in HBase shell as of now. May be you can use a simple code like this to get a number of records with no value for the provided column qualifier.
CountAndFilter [tableName] [columnFamily] [columnQualifier]
import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
public class CountAndFilter {
private static Connection conn;
private static int recordsWithoutValue = 0;
public static Admin getConnection() throws IOException {
if (conn == null) {
conn = ConnectionFactory.createConnection(HBaseConfiguration.create());
}
return conn.getAdmin();
}
public static void main(String args[]) throws IOException {
getConnection();
scan(args[0], args[1], args[2]);
System.out.println("Records with empty value : " + recordsWithoutValue);
}
public static void scan(String tableName, String columnFamily, String columnQualifier) throws IOException {
Table table = conn.getTable(TableName.valueOf(tableName));
ResultScanner rs = table.getScanner(new Scan().addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(columnQualifier)));
Result res = null;
try {
while ((res = rs.next()) != null) {
if (res.containsEmptyColumn(Bytes.toBytes(columnFamily), Bytes.toBytes(columnQualifier))){
recordsWithoutValue++;
}
}
} finally {
rs.close();
}
}
}

Is this ClassCastException a HtmlUnit bug?

I'm new to htmlunit (2.23) and I can't get this test to work:
I'm getting this ClassCastException thrown out of HtmlUnit and I don't know if it is a bug, or if I am doing something wrong.
java.lang.ClassCastException: com.gargoylesoftware.htmlunit.TextPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
at com.gargoylesoftware.htmlunit.WebClient.makeWebResponseForJavaScriptUrl(WebClient.java:1241)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:375)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:304)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:451)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:436)
at org.wyttenbach.dale.mlec.OutageTest.test(OutageTest.java:46)
...
The code
import java.awt.Desktop;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URI;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import org.junit.Assert;
import org.junit.Test;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.JavaScriptPage;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.TextPage;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class OutageTest {
private static final String SITE_URL = "https://ebill.mlecmn.net/woViewer/";
private static final String OUTAGE_MAP_URL = SITE_URL + "mapviewer.html?config=Outage+Web+Map";
#Test
public void test() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
try (final WebClient webClient = new WebClient()) {
webClient.waitForBackgroundJavaScript(20000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setUseInsecureSSL(true);
Map<String, Page> urls = new HashMap<String, Page>();
LinkedList<String> urlsToVisit = new LinkedList<String>();
urlsToVisit.add(OUTAGE_MAP_URL);
while (!urlsToVisit.isEmpty()) {
String url = urlsToVisit.remove();
if (urls.containsKey(url)) {
continue;
}
Page page = webClient.getPage(url);
urls.put(url, page);
if (page instanceof HtmlPage) {
HtmlPage page2 = (HtmlPage) page;
System.err.println("================================================================");
System.err.println(page2.asXml());
System.err.println("================================================================");
Assert.assertFalse("Outage in Nordland township: " + url, page2.asText().contains("Nordland"));
urlsToVisit.addAll(extractLinks(page2));
} else if (page instanceof JavaScriptPage) {
JavaScriptPage page2 = (JavaScriptPage) page;
Assert.assertFalse("Outage in Nordland township: " + url, page2.getContent().contains("Nordland"));
} else if (page instanceof TextPage) {
TextPage page2 = (TextPage) page;
Assert.assertFalse("Outage in Nordland township: " + url, page2.getContent().contains("Nordland"));
} else {
System.err.println(String.format("%s => %s", url, page.getClass().getName()));
}
}
} catch (AssertionError e) {
reportOutage();
throw e;
}
}
private Collection<String> extractLinks(HtmlPage page) {
List<String> links = new ArrayList<String>();
for (DomElement x : page.getElementsByTagName("script")) {
String src = x.getAttribute("src");
if (!src.contains(":")) {
src = SITE_URL + src;
System.err.println("script src="+src);
}
links.add(src);
}
for (DomElement x : page.getElementsByTagName("link")) {
String href = x.getAttribute("href");
if (!href.contains(":")) {
href = SITE_URL + href;
System.err.println("link href="+href);
}
links.add(href);
}
// Causes ClassCastException com.gargoylesoftware.htmlunit.TextPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
//at com.gargoylesoftware.htmlunit.WebClient.makeWebResponseForJavaScriptUrl(WebClient.java:1241)
for (DomElement x : page.getElementsByTagName("iframe")) {
String src = x.getAttribute("src");
if (!src.contains(":")) {
src = SITE_URL + src;
System.err.println("iframe src="+src);
}
links.add(src);
}
return links;
}
private void reportOutage() {
try {
Desktop.getDesktop().browse(new URI(OUTAGE_MAP_URL));
} catch (Exception e) {
e.printStackTrace();
}
}
}

More or less yes - but i have to do a more deeper analysis.
But there is some hope for you ;-)
Your code tries to extract urls from a given web page. During the process you are adding the url 'javascript:""' to your list of urls to be processes. This url results in this class cast exception. If you do not add this url to the list, the test is working (at least for me).

Error Handling Spring JdbcTemplate batchUpdate

I am trying to update thousands of rows in a table using batchUpdate. My requirements are:
1) Assume there are 1000 records in a batch. Record No 235 caused an error. How do I find out which record caused the error.
2) Assume that record 600 did not result in an update (reason could be no record matching the where clause). How can I find out records that did not result in an update.
3) In both scenarios above how can I continue processing the remaining records.

The only solution after long search and debug is to go to BatchUpdateException class and find the negative element and deduce the value of the insertion that is in error from the MAP.
import java.sql.BatchUpdateException;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.List;
import java.util.Map;
import org.springframework.jdbc.core.BatchPreparedStatementSetter;
import org.springframework.stereotype.Repository;
import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional;
#Transactional(propagation = Propagation.REQUIRED, rollbackFor = Exception.class)
#Repository("dao_")
public class YouDao extends CommunDao implements IyouDao {
public void bulkInsert(final List<Map<String, String>> map)
throws BusinessException {
try {
String sql = " insert into your_table " + "( aa,bb )"
+ "values " + "( ?,? )";
BatchPreparedStatementSetter batchPreparedStatementSetter = new BatchPreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps, int i)
throws SQLException {
Map<String, String> bean = map.get(i);
ps.setString(1, bean.get("aa"));
ps.setString(2, bean.get("bb"));
//..
//..
}
#Override
public int getBatchSize() {
return map.size();
}
};
getJdbcTemplate().batchUpdate(sql, batchPreparedStatementSetter);
}
catch (Exception e) {
if (e.getCause() instanceof BatchUpdateException) {
BatchUpdateException be = (BatchUpdateException) e.getCause();
int[] batchRes = be.getUpdateCounts();
if (batchRes != null && batchRes.length > 0) {
for (int index = 0; index < batchRes.length; index++) {
if (batchRes[index] == Statement.EXECUTE_FAILED) {
logger.error("Error execution >>>>>>>>>>>"
+ index + " --- , codeFail : " + batchRes[index]
+ "---, line " + map.get(index));
}
}
}
}
throw new BusinessException(e);
}
}
}

int[] rows =jdbcTemplate.batchUpdate(TbCareQueryConstant.SQL_UPDATE_BANKDETAILS_OF_USER, new BatchPreparedStatementSetter(){
.....
your code
}
for(int i=0 ; i < rows.length ; i++){
if(rows[i] == 0){
}
}

(How) Can I use Bigram Features with the OpenNLP Document Classifier

(How) Can I use Bigram Features with the OpenNLP Document Classifier?
I have a collection of very short documents (titles, phrases, and sentences), and I would like to add bigram features, of the kind used in the tool LibShortText
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
is this possible?
The documentation only explains how to do this using the Name Finder using the
BigramNameFeatureGenerator()
and not the Document Classifier

I believe the trainer and classifier allow for custom featuregenerators in their methods, however they must be implemntation of FeatureGenerator, and BigramFeatureGenerator is not an impl of that. So I made a quick impl as an inner class below.. so Try this (untested) code when you get a chance
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
public class DoccatUsingBigram {
public static void main(String[] args) throws IOException {
InputStream dataIn = new FileInputStream(args[0]);
try {
ObjectStream<String> lineStream =
new PlainTextByLineStream(dataIn, "UTF-8");
//here you can use it as part of building the model
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, 10, 100, new MyBigramFeatureGenerator());
///now you would use it like this
DocumentCategorizerME classifier = new DocumentCategorizerME(model);
String[] someData = "whatever you are trying to classify".split(" ");
Collection<String> bigrams = new MyBigramFeatureGenerator().extractFeatures(someData);
double[] categorize = classifier.categorize(bigrams.toArray(new String[bigrams.size()]));
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
}
public static class MyBigramFeatureGenerator implements FeatureGenerator {
#Override
public Collection<String> extractFeatures(String[] text) {
return generate(Arrays.asList(text), 2, "");
}
private List<String> generate(List<String> input, int n, String separator) {
List<String> outGrams = new ArrayList<String>();
for (int i = 0; i < input.size() - (n - 2); i++) {
String gram = "";
if ((i + n) <= input.size()) {
for (int x = i; x < (n + i); x++) {
gram += input.get(x) + separator;
}
gram = gram.substring(0, gram.lastIndexOf(separator));
outGrams.add(gram);
}
}
return outGrams;
}
}
}
hope this helps...

You can use NGramFeatureGenerator.java class in OpenNLP[1] for you use case.
[1] https://github.com/apache/opennlp
Thanks,
Madhawa

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig replace command - hadoop

Related

Parse XML contains incrementing namespace numbers and multiple namespaces

Find count of rows with empty value in Hbase

Is this ClassCastException a HtmlUnit bug?

Error Handling Spring JdbcTemplate batchUpdate

(How) Can I use Bigram Features with the OpenNLP Document Classifier

Categories

Resources