Solr Proximity search - sorting

I need to implement search results as per the below link..
https://pastebin.com/abuwJxQp
Example 1:
Word1 Word2 Word3
Doc1 Word1
Doc2 Word 1 Word3
Doc3 Word 1 Word2 Word3
Doc4 Word 2 Word 3
Doc5 Word2 Word1 Word3
------------>
Doc3
Doc5
Doc4
Doc2
Doc1`
Example 2:
Word1 Word2 Word3
Doc1 (Title) Word1 Word2 Word3
Doc2 (Abstract) Word1 Word3 Word3
Doc3 (AuthorName) Word1 Word2 Word3
Doc4 (JournalName) Word1 Word2 Word3
Doc5 (PublisherName) Word1 Word2 Word3
------------>
Doc1
Doc2
?
Example 3:
Word1 Word2 Word3
Doc1 Word1
Doc2 Word 1 Word3
Doc3 Word 1 Word2 Word3
Doc4 Word 2 Word 3
Doc5 Word2 Word1 Word3
Doc6 Word1 Word1 Word1 Word1 Word1 Word1 Word1 Word1 Word1 Word1 Word1 Word1
Doc7 Word1 Word2 Word1 Word2 Word1 Word2 Word1 Word2 Word1 Word2 Word1 Word2 Word1 Word2 Word1 Word2
------------>
Doc3
Doc5
Doc4
Doc2
Doc6?
Doc1

You might want to look into "distance" measures - see the answers to this question Edit Distance Similarity in Lucene/Solr

Related

windows cmd syntax remove first word i text file on every line

Hi I want following behavior in batch script windows 2012 or later.
scenario:
Example input.txt:
Word1 Word2 Word3.......Wordn
Word1 Word2 Word3.......Wordn
Word1 Word2 Word3.......Wordn
Word1 Word2 Word3.......Wordn
Example output.txt:
Word2 Word3 ........Wordn
Word2 Word3 ........Wordn
Word2 Word3 ........Wordn
Word2 Word3 ........Wordn
Used syntax:
#echo off
(for /f "tokens=1,* usebackq" %%a in ("input.txt") do #echo %%b)>"output.txt"
type output.txt give just rubish
%b
%b
%b
%b
tested without # but no difference, Checked in many examples apparently it should have been working in older windows.

bash text parsing with multiple conditions nested

I have the following code that checks for lines over 10 words and splits them where the first comma character appears. It reiterates the process so all newly split lines with over 10 words and commas are also split (in the end there are no lines with over 10 words and commas).
How do I edit this code to do the following: after all the comma splitting is done(what the current code already does), the resulting lines are checked if they have over 10 words and split where the first "and " (with space) appears?
#!/usr/bin/env bash
input=input.txt
temp=$(mktemp ${input}.XXXX)
trap "rm -f $temp" 0
while awk '
BEGIN { retval=1 }
NF >= 10 && /, / {
sub(/, /, ","ORS)
retval=0
}
1
END { exit retval }
' "$input" > "$temp"; do
mv -v $temp $input
done
Input sample:
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11 and Word12 Word13 Word14 Word15
Word1 Word2 Word3 Word4 and Word5
Desired output:
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11 and
Word12 Word13 Word14 Word15
Word1 Word2 Word3 Word4 and Word5
Thank you in advance!
Please try the following:
awk '{
while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {
if (match($0, "[^,]+,")) {
# puts a newline after the 1st comma
print substr($0, 1, RLENGTH)
$0 = substr($0, RLENGTH + 1)
} else {
# puts a newline before the 1st substring " and "
n = split($0, a, " +and +")
if (a[1] == "") { # $0 starts with " and "
a[1] = " and " a[2]
for (i = 2; i < n; i++) {
a[i] = a[i+1]
}
n--
}
print a[1]
$0 = " and " a[2]
for (i = 3; i <= n; i++) { # there are two ore more " and "
$0 = $0 " and " a[i]
}
}
}
print
}' input.txt
Output for the given input:
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11
and Word12 Word13 Word14 Word15
Word1 Word2 Word3 Word4 and Word5
[Explanations]
It iterates on the same record while the pattern space contains
more than 10 fields (excluding the word "and") && the pattern space
includes the line separator(s) in order to enable succesive splitting.
If the pattern space contains a comma, then print the left hand
and update the pattern space with the right hand.
If the pattern space contains the word " and ", the processing is a bit
difficult because the word remains in the updated pattern space.
My approach may not be elegant in a sense but it works even if a record
contains multiple (two or more) " and "s.
[EDIT]
If you want to include the word and as a part of the word count, please replace the 2nd line:
while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {
with:
while (NF > 10 && match($0, "( +and +)|,")) {
In addition, if you allow the word and to follow the
original line: the script will be a bit simplified as:
awk '{
while (NF > 10 && match($0, "( +and +)|,")) {
if (match($0, "[^,]+,")) {
# puts a newline after the 1st comma
print substr($0, 1, RLENGTH)
$0 = substr($0, RLENGTH + 1)
} else {
# puts a newline after the 1st substring " and "
n = split($0, a, " +and +")
print a[1] " and"
$0 = " " a[2]
for (i = 3; i <= n; i++) { # there are two ore more " and "
$0 = $0 " and " a[i]
}
}
}
print
}' input.txt
Moreover, if Perl is your option, you can say:
perl -ne '{
while (split > 10 && /( +and +)|,/) {
if (/^.*?(, *| +and +)/) {
print $&, "\n";
$_ = " $'\''";
}
}
print
}' input.txt
Hope this helps.
Is this your expected answer?
echo "Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21 and Word22 Word23 Word24." | grep -oE '[a-zA-Z0-9,.]+' | awk '
BEGIN {
cnt = 0
}
{
str = str " " $0
if ($0 ~ /,$/){
print str
cnt = 0
str = ""
}
else if (cnt < 10){
cnt++
}
else {
print str
cnt = 0
str = ""
}
} END {
print str
}' | sed 's/^ *//'
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21
and Word22 Word23 Word24.

How to call a variable declared in another perl file?

I've script1.pl and script2.pl. I'm looking for making script2.pl able to call the value of $string from script1.pl.
script1.pl
$string="word1 word2 word3 word4 word5 word6 word7 word8 word9";
$cmd="perl \"My\\File\\Path\\script2.pl\"";
system ($cmd);
script2.pl
print $string;
Note: I'm using perl for Windows.
Best practice is use a module. See perlmod.
In your case, you can use require. Make sure that require files return truth by adding 1.
script1.pl:
#!/usr/bin/perl
use warnings;
use strict;
our $string = "word1 word2 word3 word4 word5 word6 word7 word8 word9";
our $cmd = "perl \"My\\File\\Path\\script2.pl\"";
system ($cmd);
1;
script2.pl:
#!/usr/bin/perl
use strict;
use warnings;
use vars qw($string);
require "script1.pl";
print $string, "\n";
Output:
word1 word2 word3 word4 word5 word6 word7 word8 word9
While you can make that work, you're much better off passing in the variable as command line arguments, or if there's a lot of data, to STDIN.
# script1.pl
my $cmd = qq[$^X "My\\File\\Path\\script2.pl"];
my #words = qw[word1 word2 word3 word4 word5 word6 word7 word8 word9];
system $cmd, #words;
# script2.pl
print join ", ", #ARGV;
This doesn't scale well. You're better off rewriting script2.pl as a library and calling a function.
# mylibrary.pl
sub print_stuff {
print join ", ", #_;
}
# script1.pl
require 'mylibrary.pl';
print_stuff(qw[word1 word2 word3 word4 word5 word6 word7 word8 word9]);
For a handful of functions this will work fine. Eventually you'll want to look into writing modules.

Replacing a string in a Table with another string

I'm trying to solve a problem using the sed command.
I have a Table with data (few rows and cols).
I want to be able to replace the string in the i,j spot with a new string.
For an example :
word1 word2 word3 word4
word5 word6 word7 word8
word9 word10 word11 word12
with the input of 1,1 and abc should return
word1 word2 word3 word4
word5 abc word7 word8
word9 word10 word11 word12
And if possible, print it to a new file.
Thanks
Using awk might be easier:
awk -v c=1 -v r=1 -v w='abc' 'NR==r+1{$(c+1)=w}1' file
word1 word2 word3 word4
word5 abc word7 word8
word9 word10 word11 word12

Looking for Search Algorithm Name

I assume there is a name for what I describe here.
Basically, if I search for "word1 word2 word3" (without quotes) and I have this array:
["word1 word2",
"word1 word2 word3",
"word3 word2 word1",
"word2 word3 word1",
"word1 word3 word2 word4",
"word1 word4 word3",
"word4 word1 word2 word3"]
It should return these found results:
word1 word2 word3
word3 word2 word1
word2 word3 word1
word1 word3 word2 word4
word4 word1 word2 word3
Is there any name for such an algorithm?
The description would be:
"Search for a all strings that contain all permutations of the following words". So maybe it should be called "Permutation Search": http://www.keyworddiscovery.com/feature-permutation-search.html
If you also allow word1 word4 word2 word3 to be returned it would be called 'keyword based search' or 'full text search' with the limitation that the search text should contain all keywords (and not only a subset).
What you are doing is basically
Search : search-set{1,2,3}
In :
sample-space-set{
set{1,2,3}
set{1,2,3,4}
set{2,3,4,5}
}
Result:
result-set{
set{1,2,3}
set{1,2,3,4}
}
Which could be put more concisely as
Find all the result-set from the sample-space-set where 'search-set is a subset'.
So basically, name of algorithm could be
'find all the mother-of-subsets'
(I really dont know what is the reverse of subset relationship. If you know that do let us know all.)

Resources