Ruby and PostgreSQL text matching - ruby

I am trying to do some text matching in Ruby using PostgreSQL.
This is my code:
require 'active_record'
require 'yaml'
require 'pg'
require 'pry'
$config = '
adapter: postgresql
database: edgar
username: YYYYY
password:
host: 127.0.0.1'
ActiveRecord::Base.establish_connection(YAML::load($config))
class Doc < ActiveRecord::Base; end
class Filing < ActiveRecord::Base; end
#Searching database edgar and table for key words
#install gem install pg_search
class Filing < ActiveRecord:: Base
include PgSearch
end
class Filing < ActiveRecord::Base
pg_search_score:search_eightks,
:against => [:cancer, heart attack]
end
I have a few questions:
How do I search the eightks table of my database?
How do I search for multiple words? I want to see if a document contains the word "cancer" OR "heart attack". It doesn't need both, just one or the other.
This is what the list of relations in the database looks like:
edgar=# \d
List of relations
Schema | Name | Type | Owner
--------+-----------------------+----------+------------------
public | crsp_ccm_lookup | table | YYYY
public | docs | table | YYYY
public | docs_downloaded | table | YYYY
public | docs_id_seq | sequence | YYYY
public | document_types | table | YYYY
public | document_types_id_seq | sequence | YYYY
public | eightks | table | YYYY
public | filings | table | YYYY
public | filings_for_run | table | YYYY
public | filings_id_seq1 | sequence | YYYY
public | indices | table | YYYY
public | indices_id_seq | sequence | YYYY
public | scraper_groups | table | YYYY
public | scraper_groups_id_seq | sequence | YYYY
public | ws | view | YYYY
public | ws_table | table | YYYY
public | z_docs_10_ks | table | YYYY
(17 rows)
Ideally when a text document containing these words is found I would like to COPY it to a new folder.
Any help is greatly appreciated.

A couple of things regarding the code above specifically:
Not quite sure why you're opening the class twice in a row (and also a third time above that)
I believe pg_search_scope (note the p instead of r) is the correct term to use (I don't see pg_search_score in the code for pg_search)
So something like this:
class Filing < ActiveRecord::Base
include PgSearch
pg_search_scope:search_eightks,
:against => [:cancer, heart attack]
end
However, I am wondering if you really want to use full text search here, as that is a reasonably niche use case for Postgres.
For example, if the docs are just of type text, you could run a regex against them to check (or one of the string functions). Now, if they're large documents, or if you wanted to do some more advanced things like ranking the search results (in a somewhat Solr like fashion), then full-text may be worth a look.
What is the overall picture of what you're trying to accomplish here? (Not how, but what/why?) Will also need more info as to what the definitions of the tables involved are, and also what version of Postgres.
Also, not quite sure what you mean by copy it to a new folder. folder doesn't mean anything in Postgres. Are you mixing Postgres terms with terms from your application logic?
Edit in response to comment from OP:
Based on your comments about there being 200,000 documents (which sounds like a fairly small dataset, unless the documents themselves are absolutely enormous), and you wanting to check to see if there are a few keywords contained in them, assuming the documents are text (which it sounds like they are) I would recommend using a Postgres regex.
Assuming, for the sake of an example, that the document text is stored in a column called contents in the docs table, you could do something like this:
SELECT *
FROM docs
WHERE contents ~* 'heart attack|cancer|illness';
ActiveRecord lets you use raw SQL via the connection.execute method, where you would pass the above query as a string.
You can then process them further as desired.
I'm not saying you won't ever need to use full text search for this sort of thing, it's just that I wouldn't recommend starting with that, especially based on the information you've provided regarding your use case and data.
Edit in response to follow-up comment from OP:
The code you said you were using:
require
Select * from eightks where contents ~* 'heart attack | cancer |illness'
end
is not valid Ruby. You would need a begin to go with the end, instead of the require, and the SQL needs to be a string which is passed to your Active Record object's connection.execute method. There doesn't appear to be a connection object listed in your code above. Possibly, it would be an instance of the Filing class you're defining.
Also, the column contents will only work if you actually have that in your table definition. That was just for the example -- you'll need to adapt it to fit your specific table.
If you have additional issues getting the raw SQL to work in your scenario, it should be split off into a separate question, as this one is getting pretty long and veering into separate (although related) issues. You can link this question for additional context.

Related

jq replace substring in value

This the output of my curl
{
"expand": "renderedFields,names,schema,operations,editmeta,changelog,versionedRepresentations",
"id": "240937",
"self": "https://placeholder.atlassian.net/rest/api/latest/issue/240937",
"key": "placeholder-355",
"fields": {
"description": "We need the following page layout changes made (for all user profiles unless specified in front of the item):\n\n# Display {{TestAccount__c}} field on the account object underneath the *Pricing Group* field\n# Display {{blng__LegalEntity__c}} field that sits on {{Order Product}} to be visible on *Order* as well as *Contract* objects _(position doesn’t matter). Read only is fine as long as that works on SF reports if we need to bring this attribute in_\n# Display {{blng__LegalEntity__c}} field that sits on *Invoice Lines* to be visible on the *Invoice* object _(position doesn’t matter). Read only is fine as long as that works on SF reports if we need to bring this attribute in_\n# Display ‘*Notes*’ section on the credit request page - copy the one that exists on the contract object e.g. (sample below)\n!image-20220714-021135.png|width=50%!\n# -Display- *-Agent-* -field on contract object to enable users to see if account has an agent or not- → already in production (not needed)\n# Change default options for *tasks* - the default subjects are:\n#* Call\n#* Send Letter\n#* Send Quote\n#* Other\n# Need to update to the following:\n#* Call\n#* Video Call\n#* Face to Face Meeting\n# Add in *Contact hierarchy* functionality in Salesforce\n## Anyone can update the ‘Reports to’ field\n## Show the following fields on the hierarchy page"
}
}
With jq I'm getting json
jq --raw-output '.fields.description' jira-story.json
Result:
# Display *Pricing Group* field
# Display *Contract*
# Change default options for *tasks* - the default subjects are:
#* Call
#* Send Letter
#* Send Quote
#* Other
# Need to update to the following:
#* Call
#* Video Call
#* Face to Face Meeting
# Add in *Contact hierarchy*
## Anyone can update the ‘Reports to’ field
## Show the following fields on the hierarchy page
I want it nicely format it as
### Display *Pricing Group* field
### Display *Contract*
### Change default options for *tasks* - the default subjects are:
- Call
- Send Letter
- Send Quote
- Other
### Need to update to the following:
- Call
- Video Call
- Face to Face Meeting
### Add in *Contact hierarchy*
- Anyone can update the ‘Reports to’ field
- Show the following fields on the hierarchy page
How can I replace the value in a jq before the output?
"#*" to "-" or "##" to "-" and "#" to "###"
You don't necessarily need to perform the substitution in jq, you can pipe the output to sed:
jq -r '.fields.description' | sed 's/^#[#*]/-/;s/^#/###/'
If you really must do it with a single jq expression, then what user jhnc wrote in the comments:
jq -r '.fields.description | gsub("\n#[#*]";"\n-") | gsub("\n# ","\n### ")'

Make step definition dynamic to handle any incoming text

I have to run test1.feature file against two urls. In one of the url I have got a field named as EIN, but in second url the same field named as ABN
How can we make the step definition dynamic to handle any text coming in second string.
Url 1 : https://testsite.us.com/student
The Field is named as "EIN"
Url 2: https://testsite.au.com/student
The Field is named as "ABN"
test1.feature
And I type "11 234 234 444" in "ABN" field
step_definition
//Type text value in textbox input field
Then('I type {string} in {string} field', (textval, textboxname) => {
switch (textboxname) {
case "Email":
case "Work Phone":
case "ABN":
case "EIN":
case "ACN":
cy.get('#profile_container').parent().find('.fieldHeaderClass').contains(textboxname)
.next().find('input')
.clear({force:true})
.type(textval, { force: true });
break;
//final else condition
default:
cy.get('#profile_container').parent().find('.fieldHeaderClass').contains(textboxname)
.next()
.type(textval, { force: true });
}
});
First of all a great tip: Make use of page objects (POM design pattern). A page objects is an object (inside a third .js file besides your features and stepdifinitions) that you import in the stepdefinitions file that contains all of your selector code (cy.get(......)). You don't want selector code in your stepdefinitions. Makes it messy and more difficult to read.
Concerning your problem: If you want to repeat the same logic for (all kinds of) input values. Why not write your logic just ones (so without the long case statement) and use Scenario Outline to repeat your step for the different inputs? If your logic must be different every time, than don't even bother fixing this problem. Then you should simply write different steps for different logic..
Here is an example of scenario outline (inside a .feature):
Scenario Outline: Scenario Outline name
Given I type "<textval>" in "<textboxname>" field
Examples:
| textval | textboxname |
| some_value | ABN |
| some_other_value | EIN |
| some_other_value_2 | email |
| some_other_value_3 | ACN |
| some_other_value_4 | Work Phone |
Conclusion: Use Scenario Outline to repeat logic. If logic needs to be different for different inputs, then write another stepdefinition and don't try to combine and clip different logics into 1 step.

Count and print images from URL

This is my first time using Spark/Scala and I am lost.
I am suppose to write a program that takes in a URL and outputs the number of images and the name of the image file.
So I was able to get the image count. I am doing this all in the command prompt which is making it quite difficult to go back and edit my def without out retyping the whole thing. Is there a better alternative. It took me quite a while just to get Spark/Scala working (I would of like to u PySpark but was unable to get them to communicate)
scala> def URLcount(url : String) : String = {
| var html = scala.io.Source.fromURL(url).mkString
| var list = html.split("\n").filter(_ != "")
| val rdds = sc.parallelize(list)
| val count = rdds.filter(_.contains("img")).count()
| return("There are " + count + " images at the " + url + " site.")
| }
URLcount: (url: String)String
scala> URLcount("https://www.yahoo.com/")
res14: String = There are 9 images at the https://www.yahoo.com/ site.
So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
How would I create such list and then print it line by line to display the image urls?
I don't sure it is great solution for parsing HTML via Spark. I think that Spark created for big data (while it is general purpose). I did not find any easy way to parse HTML through Spark (but I easy find it for both XML and JSON). It is mean that in this case you will print a very long string, because HTML pages are often compressed. Anyway, for this page your program will print lines like this:
<p>So I'm assuming after I parallelize the list I should be about to apply a filter and create a list of all the strings that contain "img src"
I can advice you use Jsoup:
val yahoo = Jsoup.connect("https://www.yahoo.com").get
val images = yahoo.select("img[src]")
images.forEach(println)
You can use Spark for other purposes.
PS: I found 39 image tags with src attribute on https://www.yahoo.com. It is very easy to got error if you don't use good HTML parser.
Another way: prepare your data and than use Spark.
Sorry for my English.

Sitecore item multilistfield XPATH builder

I'm trying to count with XPATH Builder in Sitecore, the number of items which have more than 5 values in a multilist field.
I cannot count the number of "|" from raw values, so I can say I am stuck.
Any info will be helpful.
Thank you.
It's been a long time since I used XPath in Sitecore - so I may have forgotten something important - but:
Sadly, I don't think this is possible. XPath Builder doesn't really run proper XPath. It understands a subset of things that would evaluate correctly in a full XPath parser.
One of the things it can't do (on the v8-initial-release instance I have to hand) is be able to process XPath that returns things that are not Sitecore Items. A query like count(/sitecore/content/*) should return a number - but if you try to run that using either the Sitecore Query syntax, or the XPath syntax options you get an error:
If you could run such a query, then your answer would be based on an expression like this, to perform the count of GUIDs referenced by a specific field:
string-length( translate(/yourNodePath/#yourFieldName, "abcdefg1234567890{}-", "") ) + 1
(Typed from memory, as I can't run a test - so may not be entirely correct)
The translate() function replaces any character in the first string with the relevant character in the second. Hence (if I've typed it correctly) that expression should remove all your GUIDs and just leave the pipe-separator characters. Hence one plus the length of the remaining string is your answer for each Item you need to process.
But, as I say, I don't think you can actually run that from Query Builder...
These days, people tend to use Sitecore PowerShell Extensions to write ad-hoc queries like this. It's much more flexible and powerful - so if you can use that, I'd recommend it.
Edited to add: This question got a bit stuck in my head - so if you are able to use PowerShell, here's how you might do it:
Assuming you have declared where you're searching, what MultiList field you're querying, and what number of selections Items must exceed:
$root = "/sitecore/content/Root"
$field = "MultiListField"
$targetNumber = 3
then the "easy to read" code might look like this:
foreach($item in Get-ChildItem $root)
{
$currentField = Get-ItemField $item -ReturnType Field -Name $field
if($currentField)
{
$count = $currentField.Value.Split('|').Count
if($count -gt $targetNumber)
{
$item.Paths.Path
}
}
}
It iterates the children of the root item you specified, and gets the contents of your field. If that field name had a value, it then splits that into GUIDs and counts them. If the result of that count is greater than your threshold it returns the item's URI.
You can get the same answer out of a (harder to read) one-liner, which would look something like:
Get-ChildItem $root | Select-Object Paths, #{ Name="FieldCount"; Expression={ Get-ItemField $_ -ReturnType Field -Name $field | % { $_.Value.Split('|').Count } } } | Where-Object { $_.FieldCount -gt $targetNumber } | % { $_.Paths.Path }
(Not sure if that's the best way to write that - I'm no expert at PowerShell syntax - but it gives the same results as far as I can see)

How can I / Is it possible to warn the user for unused variables within a logic rule in a Prolog-like DSL developer through Xtext?

I'm new here but I hope someone can help me.
I'm developing a Prolog-like DSL for an university project.
This is a simplified grammar that I use to expertiment stuff:
grammar it.unibo.gciatto.Garbage hidden (SL_COMMENT, ML_COMMENT, WS, ANY_OTHER)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate garbage "http://www.unibo.it/gciatto/Garbage"
PTheory returns Theory
: (kb+=PExpression '.')*
;
PExpression returns Expression
: PRule
;
PRule returns Expression
: PConjunction ({ Expression.left=current } name=':-' right=PConjunction)?
;
PConjunction returns Expression
: PExpression0 ({ Expression.left=current } name=',' right=PConjunction)?
;
PExpression0 returns Expression
: PTerm
| '(' PExpression ')'
;
PTerm returns Term
: PStruct
| PVariable
| PNumber
;
PVariable returns Variable
: { AnonymousVariable } name='_'
| name=VARIABLE
;
PNumber returns Number
: value=INT
;
PStruct returns Struct
: name=ATOM '(' arg+=PExpression0 (',' arg+=PExpression0)* ')'
| PAtom
;
PAtom returns Atom
: name=ATOM
| { AtomString } name=STRING
;
terminal fragment CHARSEQ : ('a'..'z' | 'A' .. 'Z' | '0'..'9' | '_')*;
terminal ATOM : ('a'..'z') CHARSEQ;
terminal VARIABLE : ('A'..'Z') CHARSEQ;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal STRING :
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
terminal ML_COMMENT : '/*' -> '*/';
terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?;
terminal WS : (' '|'\t'|'\r'|'\n')+;
terminal ANY_OTHER: .;
When validating I'd love to search for unused variables in rules definition and suggest the user to use anonymous variable instead. Once I've understood the mechanism I may consider similar validation rules.
I know Xtext has a built-in scoping mechanism and I've been able to use it in different situations, but as you know, any IScopeProvider provides a scope for a given EReference (am I right?) and, as you can see, my grammar has no cross-references. The reason for that is simple: in Prolog a variable "definition" and its "references" are syntactically the same, so no context-free parser able to distinguish the two contexts can be generated (I'm pretty sure, even without a formal proof).
However, I think the validation algorithm is quite simple:
"while navigating the AST, collect any variable within an ad-hoc data structure and the count occurrences" or something smarter than that
Now the real question is: can I someway (re)use any part of the Xtext scoping framework, and if yes how? Or should I build a simple scoping library by my self?
Sorry for the long question and bad english, I hope I was exhaustive.
Thank you for reading.
The Xtext validation framework can reuse your scope provider instance easily, and can write validation rules. A sample for the validator is already generated for the Xtext grammar, you have to extend it with your specific validation case like follows:
public class GarbageLanguageJavaValidator extends AbstractGarbageLanguageJavaValidator {
#Inject
GarbageLanguageScopeProvider scopeProvider;
//Validation rule for theories. For any other element, change the input parameter
#Check
public void checkTheory(Theory theory) {
//here, you can simply reuse the injected scope provider
scopeProvider.getAllReferencesInTheory();
//in case of problems, report errors using the inherited error/warning methods
}
}
The created validation rules are automatically registered and executed (see also the Xtext documentation for details about validation).
I actually solved the problem a few days after I posted the question and then I was too busy to post the solution. Here it comes (for a more detailed description of both the problem and the solution, you are free to read the essay I wrote: RespectX - section 4.5 ).
I created my own IQualifiedNameProvider, called IPrologSimpleNameProvider, which simply returns the 'name' feature.
Then I created my own IContextualScopeProvider, which doesn't extend IScopeProvider. It exposes the following methods:
getContext given any AST node, it returns the root of the current context, i.e.
the EObject whose eContainer is instance of Theory and containing the
input node within its sub-tree.
getScope returns an IScope for the context of the input node.
getFilteredScope applies a type-filter to a getScope invocation (e.g. it makes
it easy to create a scope containing only Variables).
getFilteredScope filters a getScope invocation using a predicate.
Of course, the IContextualScopeProvider implementation uses an IPrologSimpleNameProvider implementation so, now, the validation rule quite simple to realize:
Given a variable, it uses the getScope method which returns an IScope containing all the variables within that context
It counts how many variables within the IScope are named after the current one
If they are lesser than 2 a warning is found.
I really hope I explained ^^"

Resources