JSoup, Extract specific text or image-link from website

JSoup, Extract specific text or image-link from website - image

This is in my HTML and I need to extract either the link of the image or the fileId to create a link.
{
"created": "2018-11-06T06:46:21.181Z",
"inRiverId": "58515",
"mediaInformation": {
"bildid": 67708,
"description": "ABC",
"excelImportField": "ABC",
"fileId": "41964",
"filename": "ABC",
"imageAccess": true,
"imageItemType": "ABC",
"imageStatus": "ABC",
"imageType": "ABC",
"itcl": "ABC",
"photographer": "ABC",
"projectName": "ABC",
"projectType": "Webb",
"type": "Bild",
"url": "https://static.john.com/images/products/41964.jpg"
},
The fileId is dynamic, therefor is the link also dynamic depending on which article I'm visiting on my website. The link will always start with "https://static.john.com/images/products/" and then it adds the fileId and ".jpeg" automatically.
The code above is of course just a piece of the code on the website, so there is more images on it, so it needs to extract it specifically.
The dream would be if Jsoup could get this output:
"https://static.john.com/images/products/" + "fileId" + ".jpeg"
I'm a beginner in Android Studio, and crazy-new to JSoup.

Bad news. Jsoup can't do this. Jsoup parses only HTML but this is a JSON fragment which will be used by Javascript after the page is loaded in the browser.
However my idea is to get this URL using regular expressions so try using this code:
String json = "{\n" +
" \"created\": \"2018-11-06T06:46:21.181Z\",\n" +
" \"inRiverId\": \"58515\",\n" +
" \"mediaInformation\": {\n" +
" \"bildid\": 67708,\n" +
" \"description\": \"ABC\",\n" +
" \"excelImportField\": \"ABC\",\n" +
" \"fileId\": \"41964\",\n" +
" \"filename\": \"ABC\",\n" +
" \"imageAccess\": true,\n" +
" \"imageItemType\": \"ABC\",\n" +
" \"imageStatus\": \"ABC\",\n" +
" \"imageType\": \"ABC\",\n" +
" \"itcl\": \"ABC\",\n" +
" \"photographer\": \"ABC\",\n" +
" \"projectName\": \"ABC\",\n" +
" \"projectType\": \"Webb\",\n" +
" \"type\": \"Bild\",\n" +
" \"url\": \"https://static.john.com/images/products/41964.jpg\"\n" +
" },";
Pattern p = Pattern.compile("\"url\": \"(https://static.john.com/images/products/\\d+.jp[e]*g)\"");
Matcher m = p.matcher(json);
if (m.find()) {
String imageUrl = m.group(1);
System.out.println(imageUrl);
}
and the output is: https://static.john.com/images/products/41964.jpg
In your case instead of json variable you should use document.html().

MainActivity.java
public class MainActivity extends AppCompatActivity {
public class JsoupGetImage {
public void main( String[] args ) throws IOException {
// Get the article number from TextView to complete link
String articlenumber = ((TextView) recyclerView.findViewHolderForAdapterPosition(0).itemView.findViewById(R.id.holderArticle)).getText().toString();
// My URL + article number
String url = "https://www.john.com/api/v1.0/articles/" + articlenumber;
// JSoup
Document doc = Jsoup.connect(url).get();
// Pattern - Find pattern using partial link
Pattern p = Pattern.compile("\"url\": \"(https://static.john.com/images/products/\\d+.jp[e]*g)\"");
// Matcher - Find match in my URL from Pattern
Matcher m = p.matcher(url);
if (m.find()) {
String imageUrl = m.group(1);
System.out.println(imageUrl);
}
}
}
}
Does this look good or am I missing something. How do I get my ImageView to show the image of the created link?
Kind regards,
John

Related

Spring Data Elastic Search with dynamic document replicas and shards

I'm using spring boot 2.4 and spring-data-elasticsearch 4.1. I have document like this
#Document(indexName = "test", replicas = 3, shards = 2)
public class TestDocument {
#Id
#Field(type = FieldType.Keyword)
private String Id;
#Field(type = FieldType.Object, enabled = false)
private String name;
...
getters
setters
}
And i want to override hardcoded values in replicas and shards in Document annotation from application.yml for index creation because this values can be differnet by environment of my service. Is there any way to do this?

You can disable auto-creation of an index by using
#Document(indexName = "test", createIndex = false)
and then create the index by using the IndexOperations.create(settings). The following code is taken from Spring Data Elasticsearch tests (https://github.com/spring-projects/spring-data-elasticsearch/blob/4.1.x/src/test/java/org/springframework/data/elasticsearch/core/ElasticsearchTemplateTests.java#L2361-L2384):
public void shouldCreateIndexWithGivenClassAndSettings() {
// given
String settings = "{\n" + " \"index\": {\n" + " \"number_of_shards\": \"1\",\n"
+ " \"number_of_replicas\": \"0\",\n" + " \"analysis\": {\n"
+ " \"analyzer\": {\n" + " \"emailAnalyzer\": {\n"
+ " \"type\": \"custom\",\n"
+ " \"tokenizer\": \"uax_url_email\"\n" + " }\n"
+ " }\n" + " }\n" + " }\n" + '}';
// when
indexOperations.delete();
indexOperations.create(parse(settings));
indexOperations.putMapping(SampleEntity.class);
indexOperations.refresh();
// then
Map<String, Object> map = indexOperations.getSettings();
assertThat(operations.indexOps(IndexCoordinates.of(INDEX_NAME_SAMPLE_ENTITY)).exists()).isTrue();
assertThat(map.containsKey("index.number_of_replicas")).isTrue();
assertThat(map.containsKey("index.number_of_shards")).isTrue();
assertThat((String) map.get("index.number_of_replicas")).isEqualTo("0");
assertThat((String) map.get("index.number_of_shards")).isEqualTo("1");
}
Instead of parsing the JSON for the settings you should create a Document in your code:
Document settings = Document.create();
settings.put("index.number_of_replicas", 42);
settings.put("index.number_of_shards", 42);

you can Add a #Setting annotion to you Entity, which is provided by spring-data-elasticsearch, in order to help you custom index info .
https://docs.spring.io/spring-data/elasticsearch/docs/current/reference/html/#elasticsearc.misc.index.settings

Spring Data MongoDb - Criteria equivalent to a given query that uses $expr

I have a collection with documents like this:
{
"_id" : ObjectId("5a8ec4620cd3c2a4062548ec"),
"start" : 20,
"end" : 80
}
and I want to show the documents that overlap a given percentage (50%) with an interval (startInterval = 10, endInterval = 90).
I calculate the overlaping section with the following formula:
min(end , endInterval) - max(start, startInterval ) / (endInterval - startInterval)
In this example:
min(80,90) - max(20,10) / (90-10) = (80-20)/80 = 0.75 --> 75%
Then this document will be shown, as 75% is greater than 50%
I expressed this formula in mongo shell as:
db.getCollection('variants').find(
{
$expr: {
$gt: [
{
$divide: [
{
$subtract: [
{ $min: [ "$end", endInterval ] }
,
{ $max: [ "$start", startInterval ] }
]
}
,
{ $subtract: [ endInterval, startInterval ] }
]
}
,
overlap
]
}
}
)
where
overlap = 0.5, startInterval = 10 and endInterval= 90
It works fine in mongo shell.
I'm asking for an equivalent way to calculate this using Spring Data Criteria, since the $expr functionality I used in mongo shell is still to be implemented in Spring Data Mongo.
Currently I'm using Spring Boot 2.0.0, Spring Data MongoDb 2.0.5 and mongodb 3.6.
Thanks a lot for your time.

There is an open issue to add support for $expr : https://github.com/spring-projects/spring-data-mongodb/issues/2750
In the meantime, you can use an BasicQuery:
BasicQuery query = new BasicQuery("{ $expr: {'$gt': ['$results.cache.lastHit', '$results.cache.expiration']}}");
return ofNullable(mongoTemplate.findAndModify(query, updateDefinition, XXXX.class));
You can even concatenate your existing Criteria with BasicQuery, keeping it exclusive to the $expr:
Criteria criteria = Criteria.where("results.cache.cacheUpdateRetriesLeft").gt(4);
BasicQuery query = new BasicQuery("{ $expr: {'$gt': ['$results.cache.lastHit', '$results.cache.expiration']}}");
query.addCriteria(criteria);
return ofNullable(mongoTemplate.findAndModify(query, updateDefinition, XXXX.class));

Just in case it is helpful for somebody, I finally solved my problem using $redact.
String redact = "{\n" +
" \"$redact\": {\n" +
" \"$cond\": [\n" +
" {\n" +
" \"$gte\": [\n" +
" {\n" +
" \"$divide\": [\n" +
" {\n" +
" \"$subtract\": [\n" +
" {\n" +
" \"$min\": [\n" +
" \"$end\",\n" +
" " + endInterval + "\n" +
" ]\n" +
" },\n" +
" {\n" +
" \"$max\": [\n" +
" \"$start\",\n" +
" " + startInterval + "\n" +
" ]\n" +
" }\n" +
" ]\n" +
" },\n" +
" {\n" +
" \"$subtract\": [\n" +
" " + endInterval + "\n" +
" " + startInterval + "\n" +
" ]\n" +
" }\n" +
" ]\n" +
" },\n" +
" " + overlap + "\n" +
" ]\n" +
" },\n" +
" \"$$KEEP\",\n" +
" \"$$PRUNE\"\n" +
" ]\n" +
" }\n" +
" }";
RedactAggregationOperation redactOperation = new RedactAggregationOperation(
Document.parse(redact)
);
where RedactAggregationOperation is
public class RedactAggregationOperation implements AggregationOperation {
private Document operation;
public RedactAggregationOperation (Document operation) {
this.operation = operation;
}
#Override
public Document toDocument(AggregationOperationContext context) {
return context.getMappedObject(operation);
}
}

As you mentioned, Spring Data Mongo currently does not support $expr, so I have to use custom BSON document, and reflection of MongoTemplate.
public List<Variant> listTest() throws Exception {
double overlap = 0.5;
int startInterval = 10;
int endInterval= 90;
String jsonQuery = "{$expr:{$gt:[{$divide:[{$subtract:[{$min:[\"$end\","+endInterval+"]},{$max:[\"$start\","+startInterval+"]}]},{$subtract:["+endInterval+","+startInterval+"]}]},"+overlap+"]}}";
Document query = Document.parse(jsonQuery);
Method doFind = MongoTemplate.class.getDeclaredMethod("doFind", String.class, Document.class,Document.class,Class.class);
doFind.setAccessible(true);
return (List<Variant>) doFind.invoke(mongoTemplate, "variants", query, new Document(), Variant.class);
}
#NoArgsConstructor #Getter #Setter #ToString
public static class Variant{
int start;
int end;
}
As you may see, field mapping works OK.
Used Spring Data Mongo artifact is org.springframework:data.spring-data-mongodb:2.1.5.RELEASE

Support for $expr operator in spring-data-mongodb library is still non-existent. However there is a work around solution using MongoTemplate to solve this problem -
Aggregation.match() provides an overloaded method that accepts AggregationExpression as a parameter. This method can be used to create the query for $match aggregation pipeline with $expr operator as below -
Example usage of AggregationExpression for $match operator -
Aggregation aggregationQuery = Aggregation.newAggregation(Aggregation.match(AggregationExpression.from(MongoExpression.create("'$expr': { '$gte': [ '$foo', '$bar'] }"))));
mongoTemplate.aggregate(aggregationQuery, Entity.class);
Above code is the equivalent of query -
db.collection.aggregate([{"$match": {"$expr": {"$gte: ["$foo", "$bar"]}}}]);

How can we improve the Update / Write operation on BaseX datastore?

I am using BaseX (XML based datastore) for its performance benchmarking. For testing it with ,
TestBeds
I) 10,000 users, 10 friends, 10 resources
II) 100,000 users , 10 friends, 10 resources
I faced below issues:
1) Loading of data is too slow. Gets slowed with eh increase in the number of threads.
2) Plus point - Reading/retriving values from BaseX is faster (17k operation per second)
3) Updating the data in BaseX is very slow. Throughput is ~10 operations per second.
Am I correct to say BaseX is 'TOO' slow for write/update operations (20/sec) compared to read/retrieve (10k/sec)?
Please advice me to make it more efficient for the write and update :
I have a function insertEntity (update or insert function) in to the BaseX datastore as follows -
public int insertEntity(String entitySet, String entityPK,
HashMap<String, ByteIterator> values, boolean insertImage) {
String parentTag ="",childTag ="", key="", entryTag="";
StringBuffer insertData = new StringBuffer();
Set<String> keys = values.keySet();
Iterator<String> iterator = keys.iterator();
while(iterator.hasNext()) {
String entryKey = iterator.next();
if(!(entryKey.equalsIgnoreCase("pic") || entryKey.equalsIgnoreCase("tpic")))
insertData.append("element " + entryKey + " {\"" + StringEscapeUtils.escapeXml(values.get(entryKey).toString()) + "\"},");
}
if(entitySet.equalsIgnoreCase("users")&& insertImage){
byte[] profileImage = ((ObjectByteIterator)values.get("pic")).toArray();
String encodedpImage = DatatypeConverter.printBase64Binary(profileImage);
insertData.append(" element pic {\"" + encodedpImage + "\"},");
profileImage = ((ObjectByteIterator)values.get("tpic")).toArray();
encodedpImage = DatatypeConverter.printBase64Binary(profileImage);
insertData.append(" element tpic {\"" + encodedpImage + "\"},");
}
if(entitySet.equalsIgnoreCase("users"))
{
parentTag = "users";
childTag = "members";
entryTag = "member";
key = "mem_id";
insertData.append("element confirmed_friends {}, element pending_friends {}");
}
if(entitySet.equalsIgnoreCase("resources"))
{
parentTag = "resources";
childTag = "resources";
entryTag = "resource";
key = "rid";
insertData.append("element manipulations {}");
}
try {
session.execute(new XQuery(
"insert node element " + entryTag
+ "{ attribute " + key + "{"
+ entityPK + "}, "
+ insertData.toString()
+ "} "
+ "into doc('" + databaseName + "/" + parentTag +".xml')/" + childTag
));
String q1 = "insert node element " + entryTag
+ "{ attribute " + key + "{"
+ entityPK + "}, "
+ insertData.toString()
+ "} "
+ "into doc('" + databaseName + "/" + parentTag +".xml')/" + childTag;
System.out.println(q1);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return 0;
}
And the below function is acceptFriendship (update function)
public int acceptFriend(int inviterID, int inviteeID) {
// TODO Auto-generated method stub
String acceptFriendQuery1 = "insert node <confirmed_friend id = '"
+ inviterID + "'>"
+ " </confirmed_friend>"
+ "into doc('"+databaseName+"/users.xml')/members/member[#mem_id = '"+inviteeID+"']/confirmed_friends";
String acceptFriendQuery2 = "insert node <confirmed_friend id = '"
+ inviteeID + "'>"
+ " </confirmed_friend>"
+ "into doc('"+databaseName+"/users.xml')/members/member[#mem_id = '"+inviterID+"']/confirmed_friends";
String acceptFriendQuery3 = "delete node doc('"+databaseName+"/users.xml')/members/member[#mem_id = '"
+ inviteeID + "']/pending_friends/pending_friend[#id = '"+ inviterID +"']";
try {
session.execute(new XQuery(acceptFriendQuery1));
session.execute(new XQuery(acceptFriendQuery2));
session.execute(new XQuery(acceptFriendQuery3));
System.out.println("Inviter: "+inviterID +" AND Invitee: "+inviteeID);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return 0;
}

What's the most appropriate way to compare dates in this hibernate query?

I have a Spring MVC REST service that accepts two #RequestParams called from and to. These are parsed as java.util.Date and passed to the following method in my DAO class.
#Override
public List<ErrorsDTOEntity> getAllErrors(Date from, Date to) {
try {
Query query = getSession().createQuery(
"SELECT NEW com.mydomain.esb.jpa.dto.ErrorsDTOEntity(ee, ec.message) "
+ "FROM ErrorsEntity ee, EventCodeEntity ec "
+ "WHERE ee.responseTime > " + from.getTime() + " "
+ "AND ee.responseTime < " + to.getTime() + " "
+ "AND ee.serviceResponseCode = ec.code "
+ "GROUP BY ee.domainName, ee.serviceName, ec.message, ee.serviceErrorCount, ee.errorTimestamp, "
+ "ee.deviceName, ee.servErrId, ee.serviceResponseCode, ee.elapsedTime, ee.forwardTime, "
+ "ee.responseCompletionTime, ee.responseSizeAverage, ee.requestSizeAverage, ee.responseTime "
+ "ORDER BY ee.domainName, ee.serviceName, ec.message, ee.errorTimestamp");
#SuppressWarnings("unchecked")
List<ErrorsDTOEntity> services = (List<ErrorsDTOEntity>) query.list();
return services;
} catch (HibernateException hex) {
hex.printStackTrace();
}
return null;
}
This is throwing the following SQL error:
org.hibernate.exception.SQLGrammarException: ORA-00932: inconsistent datatypes: expected TIMESTAMP got NUMBER
What's the proper way to structure this query so I can only fetch results between the from and to dates?

I figured it out, this works:
#Override
public List<ErrorsDTOEntity> getAllErrors(Date from, Date to) {
try {
Query query = getSession().createQuery(
"SELECT NEW com.mydomain.esb.jpa.dto.ErrorsDTOEntity(ee, ec.message) "
+ "FROM ErrorsEntity ee, EventCodeEntity ec "
+ "WHERE ee.responseTime > :from "
+ "AND ee.responseTime < :to "
+ "AND ee.serviceResponseCode = ec.code "
+ "GROUP BY ee.domainName, ee.serviceName, ec.message, ee.serviceErrorCount, ee.errorTimestamp, "
+ "ee.deviceName, ee.servErrId, ee.serviceResponseCode, ee.elapsedTime, ee.forwardTime, "
+ "ee.responseCompletionTime, ee.responseSizeAverage, ee.requestSizeAverage, ee.responseTime "
+ "ORDER BY ee.domainName, ee.serviceName, ec.message, ee.errorTimestamp");
query.setTimestamp("from", from);
query.setTimestamp("to", to);
#SuppressWarnings("unchecked")
List<ErrorsDTOEntity> services = (List<ErrorsDTOEntity>) query.list();
return services;
} catch (HibernateException hex) {
hex.printStackTrace();
}
return null;
}

What is the Xpath expression to select all nodes that have text when using the Firefox WebDriver?

I would like to select all nodes, that have text in them.
In this example the outer shouldBeIgnored tag, should not be selected:
<shouldBeIgnored>
<span>
the outer Span should be selected
</span>
</shouldBeIgnored>
Some other posts suggest something like this: //*/text().
However, this doesn't work in firefox.
This is a small UnitTest to reproduce the problem:
public class XpathTest {
final WebDriver webDriver = new FirefoxDriver();
#Test
public void shouldNotSelectIgnoredTag() {
this.webDriver.get("http://www.s2server.de/stackoverflow/11773593.html");
System.out.println(this.webDriver.getPageSource());
final List<WebElement> elements = this.webDriver.findElements(By.xpath("//*/text()"));
for (final WebElement webElement : elements) {
assertEquals("span", webElement.getTagName());
}
}
#After
public void tearDown() {
this.webDriver.quit();
}
}

If you want to select all nodes that contain text then you can use
//*[text()]
Above xpath will look for any element which contains text. Notice the text() function which is used to determine if current node has text or not.
In your case it will select <span> tag as it contains text.

You can call a javascript function, which shall return you text nodes:
function GetTextNodes(){
var lastNodes = new Array();
$("*").each(function(){
if($(this).children().length == 0)
lastNodes.push($(this));
});
return lastNodes;
}
Selenium WebDriver code:
IJavaScriptExecutor jscript = driver as IJavaScriptExecutor;
List<IWebElement> listTextNodes = jscript.ExecuteScript("return GetTextNodes();");
FYI: Something like might work for you.

I see no reason why this wouldn't work
(by java)
text = driver.findElement(By.xpath("//span")).getText()
If in the odd case that doesnt work:
text = driver.findElement(By.xpath("//span")).getAttribute("innerHTML")

Finally i found out that there is no way to do it with xpath (because XPaths text() selects also the innerText of a node). As workaround i have to inject Java Script that returns all elements, selected by an XPath, that has some text.
Like this:
public class XpathTest
{
//#formatter:off
final static String JS_SCRIPT_GET_TEXT = "function trim(str) { " +
" return str.replace(/^\s+|\s+$/g,''); " +
"} " +
" " +
"function extractText(element) { " +
" var text = ''; " +
" for ( var i = 0; i < element.childNodes.length; i++) { " +
" if (element.childNodes[i].nodeType === Node.TEXT_NODE) { " +
" nodeText = trim(element.childNodes[i].textContent); " +
" " +
" if (nodeText) { " +
" text += element.childNodes[i].textContent + ' '; " +
" } " +
" } " +
" } " +
" " +
" return trim(text); " +
"} " +
" " +
"function selectElementsHavingTextByXPath(expression) { " +
" " +
" result = document.evaluate(\".\" + expression, document.body, null, " +
" XPathResult.ANY_TYPE, null); " +
" " +
" var nodesWithText = new Array(); " +
" " +
" var node = result.iterateNext(); " +
" while (node) { " +
" if (extractText(node)) { " +
" nodesWithText.push(node) " +
" } " +
" " +
" node = result.iterateNext(); " +
" } " +
" " +
" return nodesWithText; " +
"} " +
"return selectElementsHavingTextByXPath(arguments[0]);";
//#formatter:on
final WebDriver webDriver = new FirefoxDriver();
#Test
public void shouldNotSelectIgnoredTag()
{
this.webDriver.get("http://www.s2server.de/stackoverflow/11773593.html");
final List<WebElement> elements = (List<WebElement>) ((JavascriptExecutor) this.webDriver).executeScript(JS_SCRIPT_GET_TEXT, "//*");
assertFalse(elements.isEmpty());
for (final WebElement webElement : elements)
{
assertEquals("span", webElement.getTagName());
}
}
#After
public void tearDown()
{
this.webDriver.quit();
}
}
I modified the UnitTest that the example testable.

One problem with locating text nodes is that even empty strings are considered as valid text nodes (e.g
<tag1><tag2/></tag1>
has no text nodes but
<tag1> <tag2/> </tag1>
has 2 text nodes, one with 2 spaces and another with 4 spaces )
If you want only the text nodes that have non-empty text, here is one way to do it:
//text()[string-length(normalize-space(.))>0]
or to get their parent elements
//*[text()[string-length(normalize-space(.))>0]]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

JSoup, Extract specific text or image-link from website - image

Related

Spring Data Elastic Search with dynamic document replicas and shards

Spring Data MongoDb - Criteria equivalent to a given query that uses $expr

How can we improve the Update / Write operation on BaseX datastore?

What's the most appropriate way to compare dates in this hibernate query?

What is the Xpath expression to select all nodes that have text when using the Firefox WebDriver?

Categories

Resources