Get excerpt around search word

Get excerpt around search word - full-text-search

I'm using Sphinx to index and search my website.
Is there a way to return excerpt (searched word and few words around it), when searching for some word.
Right now I'm using this package scalia.
$results = SphinxSearch::search($search, $index_type);
// I Set match, sort and ranking mode
$results
->setMatchMode($search_mode)
->setSortMode($sort_mode, $sort_column)
->setRankingMode(\Sphinx\SphinxClient::SPH_RANK_SPH04);
//set field weights
$results->setFieldWeights(array('title' => 10,'content'=> 5));
//and get results
$results = $results->limit(300, 0, 1000, 100000)->query();
Is there a way to manage Sphinx to return excerpt of text where it found searched keywords?

You need the BuildExcerpts function from the SphinxAPI
http://sphinxsearch.com/docs/current.html#api-func-buildexcerpts
Alas that extension, does not seem to expose the function, so would have to either modify the extension, or just bypass it.

Related

How to make Neptune Search more lenient

I have some entries inside the graph that I am searching (e.g. hello_world, foo_bar_baz) and I want to be able to search "hello" and get hello_world back.
Currently, I will only get a result if I search the entire string (i.e. searching hello_world or foo_bar_baz)
This seems to be due to elasticsearch's standard analyzer behaviour but I don't know how to deal with this with Neptune.
with neptune_graph() as g:
my_query = " OR ".join(
f"predicates.{field}.value:({query})" for field in ['names', 'spaces']
)
search_results = (
g.withSideEffect(
"Neptune#fts.endpoint", f"https://{neptuneSearchURL}"
)
.withSideEffect("Neptune#fts.queryType", "query_string")
.withSideEffect("Neptune#fts.sortOrder", "DESC")
.V()
.hasLabel("doc")
.has(
"*",
f"Neptune#fts entity_type:table AND ({my_query})",
)
)

One way is to use a wild card.
Given:
g.addV('search-test').property('name','Hello_World')
v[0ebedfda-a9bd-e320-041a-6e98da9b1379]
Assuming the search integration is all in place, after the search index has been updated, the following will find the vertex:
g.withSideEffect("Neptune#fts.endpoint",
"https://vpc-neptune-xxx-abc123.us-east-1.es.amazonaws.com").
withSideEffect('Neptune#fts.queryType', 'query_string').
V().
has('name','Neptune#fts hello*').
elementMap().
unfold()
Which yields
{<T.id: 1>: '0ebedfda-a9bd-e320-041a-6e98da9b1379'}
{<T.label: 4>: 'search-test'}
{'name': 'Hello_World'}

The problem I was having was indeed the analyzer, except I didn't understand how to fix it until now.
When creating the elasticsearch index in the first place, you need to set what settings you want.
The solution was creating index using
with neptune_search() as es:
es.indices.create(index="my_index", body={/*set custom analyser here*/});
es.index(index="my_index", ... other stuff);
# example of changing the analyser (needs "" around keys and values)
#body={
# settings:{analysis:{analyzer:{default:{
# type: custom,
# tokenizer:"lowercase"
# }}}}
#}

Sitecore item multilistfield XPATH builder

I'm trying to count with XPATH Builder in Sitecore, the number of items which have more than 5 values in a multilist field.
I cannot count the number of "|" from raw values, so I can say I am stuck.
Any info will be helpful.
Thank you.

It's been a long time since I used XPath in Sitecore - so I may have forgotten something important - but:
Sadly, I don't think this is possible. XPath Builder doesn't really run proper XPath. It understands a subset of things that would evaluate correctly in a full XPath parser.
One of the things it can't do (on the v8-initial-release instance I have to hand) is be able to process XPath that returns things that are not Sitecore Items. A query like count(/sitecore/content/*) should return a number - but if you try to run that using either the Sitecore Query syntax, or the XPath syntax options you get an error:
If you could run such a query, then your answer would be based on an expression like this, to perform the count of GUIDs referenced by a specific field:
string-length( translate(/yourNodePath/#yourFieldName, "abcdefg1234567890{}-", "") ) + 1
(Typed from memory, as I can't run a test - so may not be entirely correct)
The translate() function replaces any character in the first string with the relevant character in the second. Hence (if I've typed it correctly) that expression should remove all your GUIDs and just leave the pipe-separator characters. Hence one plus the length of the remaining string is your answer for each Item you need to process.
But, as I say, I don't think you can actually run that from Query Builder...
These days, people tend to use Sitecore PowerShell Extensions to write ad-hoc queries like this. It's much more flexible and powerful - so if you can use that, I'd recommend it.
Edited to add: This question got a bit stuck in my head - so if you are able to use PowerShell, here's how you might do it:
Assuming you have declared where you're searching, what MultiList field you're querying, and what number of selections Items must exceed:
$root = "/sitecore/content/Root"
$field = "MultiListField"
$targetNumber = 3
then the "easy to read" code might look like this:
foreach($item in Get-ChildItem $root)
{
$currentField = Get-ItemField $item -ReturnType Field -Name $field
if($currentField)
{
$count = $currentField.Value.Split('|').Count
if($count -gt $targetNumber)
{
$item.Paths.Path
}
}
}
It iterates the children of the root item you specified, and gets the contents of your field. If that field name had a value, it then splits that into GUIDs and counts them. If the result of that count is greater than your threshold it returns the item's URI.
You can get the same answer out of a (harder to read) one-liner, which would look something like:
Get-ChildItem $root | Select-Object Paths, #{ Name="FieldCount"; Expression={ Get-ItemField $_ -ReturnType Field -Name $field | % { $_.Value.Split('|').Count } } } | Where-Object { $_.FieldCount -gt $targetNumber } | % { $_.Paths.Path }
(Not sure if that's the best way to write that - I'm no expert at PowerShell syntax - but it gives the same results as far as I can see)

Selenium Webdriver + Ruby regex: Can I use regex with find_element?

I am trying to click an element that changes per each order like so
edit_div_123
edit_div_124
edit_div_xxx
xxx = any three numbers
I have tried using regex like so:
#driver.find_element(:css, "#edit_order_#{\d*} > div.submit > button[name=\"commit\"]").click
#driver.find_element(:xpath, "//*[(#id = "edit_order_#{\d*}")]//button").click
Is this possible? Any other ways of doing this?

You cannot use Regexp, like the other answers have indicated.
Instead, you can use a nifty CSS Selector trick:
#driver.find_element(:css, "[id^=\"edit_order_\"] > div.submit > button[name=\"commit\"]").click
Using:
^= indicates to find the element with the value beginning with your criteria.
*= says the criteria should be found anywhere within the element's value
$= indicates to find the element with with your criteria at the end of the value.
~= allows you to find the element based on a single criteria when the actual value has multiple space-seperated list of values.
Take a look at http://net.tutsplus.com/tutorials/html-css-techniques/the-30-css-selectors-you-must-memorize/ for some more info on other neat CSS tricks you should add to your utility belt!

You have no provided any html fragment that you are working on. Hence my answer is just based on the limited inputs provided your question.
I don't think WebDriver APIs support regex for locating elements. However, you can achieve what you want using just plain XPath as follows:
//*[starts-with(#id, 'edit_div_')]//button
Explanation: Above xpath will try to search all <button> nodes present under all elements whose id attribute starts with string edit_div_
In short, you can use starts-with() xpath function in order to match element with id format as edit_div_ followed by any number of characters

No, you can not.
But you should do something like this:
function hasClass(element, className) {
var re = new RegExp('(?:^|\\s+)' + className + '(?:\\s+|$)');
return re.test(element.className);
}

This worked for me
#driver.find_element(:xpath, "//a[contains(#href, 'person')]").click

Read image IPTC data

I'm having some trouble with reading out the IPTC data of some images, the reason why I want to do this, is because my client has all the keywords already in the IPTC data and doesn't want to re-enter them on the site.
So I created this simple script to read them out:
$size = getimagesize($image, $info);
if(isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
print '<pre>';
var_dump($iptc['2#025']);
print '</pre>';
}
This works perfectly in most cases, but it's having trouble with some images.
Notice: Undefined index: 2#025
While I can clearly see the keywords in photoshop.
Are there any decent small libraries that could read the keywords in every image? Or am I doing something wrong here?

I've seen a lot of weird IPTC problems. Could be that you have 2 APP13 segments. I noticed that, for some reasons, some JPEGs have multiple IPTC blocks. It's possibly the problem with using several photo-editing programs or some manual file manipulation.
Could be that PHP is trying to read the empty APP13 or even embedded "thumbnail metadata".
Could be also problem with segments lenght - APP13 or 8BIM have lenght marker bytes that might have wrong values.
Try HEX editor and check the file "manually".

I have found that IPTC is almost always embedded as xml using the XMP format, and is often not in the APP13 slot. You can sometimes get the IPTC info by using iptcparse($info['APP1']), but the most reliable way to get it without a third party library is to simply search through the image file from the relevant xml string (I got this from another answer, but I haven't been able to find it, otherwise I would link!):
The xml for the keywords always has the form "<dc:subject>...<rdf:Seq><rdf:li>Keyword 1</rdf:li><rdf:li>Keyword 2</rdf:li>...<rdf:li>Keyword N</rdf:li></rdf:Seq>...</dc:subject>"
So you can just get the file as a string using file_get_contents(get_attached_file($attachment_id)), use strpos() to find each opening (<rdf:li>) and closing (</rdf:li>) XML tag, and grab the keyword between them using substr().
The following snippet works for all jpegs I have tested it on. It will fill the array $keys with IPTC tags taken from an image on wordpress with id $attachment_id:
$content = file_get_contents(get_attached_file($attachment_id));
// Look for xmp data: xml tag "dc:subject" is where keywords are stored
$xmp_data_start = strpos($content, '<dc:subject>') + 12;
// Only proceed if able to find dc:subject tag
if ($xmp_data_start != FALSE) {
$xmp_data_end = strpos($content, '</dc:subject>');
$xmp_data_length = $xmp_data_end - $xmp_data_start;
$xmp_data = substr($content, $xmp_data_start, $xmp_data_length);
// Look for tag "rdf:Seq" where individual keywords are listed
$key_data_start = strpos($xmp_data, '<rdf:Seq>') + 9;
// Only proceed if able to find rdf:Seq tag
if ($key_data_start != FALSE) {
$key_data_end = strpos($xmp_data, '</rdf:Seq>');
$key_data_length = $key_data_end - $key_data_start;
$key_data = substr($xmp_data, $key_data_start, $key_data_length);
// $ctr will track position of each <rdf:li> tag, starting with first
$ctr = strpos($key_data, '<rdf:li>');
// Initialize empty array to store keywords
$keys = Array();
// While loop stores each keyword and searches for next xml keyword tag
while($ctr != FALSE && $ctr < $key_data_length) {
// Skip past the tag to get the keyword itself
$key_begin = $ctr + 8;
// Keyword ends where closing tag begins
$key_end = strpos($key_data, '</rdf:li>', $key_begin);
// Make sure keyword has a closing tag
if ($key_end == FALSE) break;
// Make sure keyword is not too long (not sure what WP can handle)
$key_length = $key_end - $key_begin;
$key_length = (100 < $key_length ? 100 : $key_length);
// Add keyword to keyword array
array_push($keys, substr($key_data, $key_begin, $key_length));
// Find next keyword open tag
$ctr = strpos($key_data, '<rdf:li>', $key_end);
}
}
}
I have this implemented in a plugin to put IPTC keywords into WP's "Description" field, which you can find here.

ExifTool is very robust if you can shell out to that (from PHP it looks like?)

Pulling Images from rss/atom feeds using magpie rss

Im using php and magpie and would like a general way of detecting images in feed item. I know some websites place images within the enclosure tag, others like this images[rss] and some simply add it to description. Is there any one with a general function for detecting if rss item has image and extracting image url after its been parsed by magpie?
i think reqular expressions would be needed to extract from description but im a noob at those. Please help if you can.

I spent ages searching for a way of displaying images in RSS via Magpie myself, and in the end I had to examine the code to figure out how to get it to work.
Like you say, the reason Magpie doesn't pick up images in the element is because they are specified using the 'enclosure' tag, which is an empty tag where the information is in the attributes, e.g.
<enclosure url="http://www.mysite.com/myphoto.jpg" length="14478" type="image/jpeg" />
As a hack to get it to work quickly for me I added the following lines of code into rss_parse.inc:
function feed_start_element($p, $element, &$attrs) {
...
if ( $el == 'channel' )
{
$this->inchannel = true;
}
...
// START EDIT - add this elseif condition to the if ($el=xxx) statement.
// Checks if element is enclosure tag, and if so store the attribute values
elseif ($el == 'enclosure' ) {
if ( isset($attrs['url']) ) {
$this->current_item['enclosure_url'] = $attrs['url'];
$this->current_item['enclosure_type'] = $attrs['type'];
$this->current_item['enclosure_length'] = $attrs['length'];
}
}
// END EDIT
...
}
The url to the image is in $myRSSitem['enclosure_url'] and the size is in $myRSSitem['enclosure_length'].
Note that enclosure tags can refer to many types of media, so first check if the type is actually an image by checking $myRSSitem['enclosure_type'].
Maybe someone else has a better suggestion and I'm sure this could be done more elegantly to pick up attributes from other empty tags, but I needed a v quick fix (deadline pressures) but I hope this might help someone else in difficulty!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Get excerpt around search word - full-text-search

You need the BuildExcerpts function from the SphinxAPI http://sphinxsearch.com/docs/current.html#api-func-buildexcerpts Alas that extension, does not seem to expose the function, so would have to either modify the extension, or just bypass it.

Related

How to make Neptune Search more lenient

Sitecore item multilistfield XPATH builder

Selenium Webdriver + Ruby regex: Can I use regex with find_element?

Read image IPTC data

Pulling Images from rss/atom feeds using magpie rss

Categories

Resources