I have a compressed freebase data dump that has all the entities in it. How can I use grep or something else to trim the data dump to only contain english entities?
Here is what I am trying to get the rdf dump to look like: http://play.golang.org/p/-WwSysL3y3
<card>
<title></title>
<image></image>
<text></text>
<facts>
<fact></fact>
<fact></fact>
<fact></fact>
</fact>
</card>
Where card is each entity with content in all of the children elements. Title is the /type/object/name. Text is the image for mid of the topic done by "https://usercontent.googleapis.com/freebase/v1/image"%s"\n", id. Text is the /common/document/text for the entity. and facts and its fact children as the facts like age, birth-date, height, the facts that show up in the knowledge panels in search.
Here is my attempt to parse the rdf into xml like this in Go ( Golang ). I'd appreciate it if someone could help me get the rdf in this form.
Here is the algorithm or logic of what I am trying to do:
For every entity written in english:
parse the `type/object/name`property's and write that to the xml file in the `<title></title>` element.
parse the mid and add that to `https://usercontent.googleapis.com/freebase/v1/image`and then write the result to the xml file in the <image></image> element.
parse the common/document/text property and writes its value to the <text></text> element.
And lastly, for each fact about the entity, write them to the <fact></fact> elements in the XML file, which are all children of the <facts></facts> element.
I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a synonym for Freebase object, which may have labels in multiple languages (or no labels/text at all).
If we recast the question as something along the lines of "How do I filter all non-English text from the compressed Freebase dump?," it becomes something that we can actually answer.
In RDF, all strings are labeled with their language, so if we see something like
ns:award.award_winner rdfs:label "Lauréat"#fr.
We can tell that Lauréat is the French name for the Freebase type called Award Winner in English.
To filter out non-English labels, use zgrep to filter those lines which match "#... but not "#en.
This will give you all the types, properties, numbers, and English labels/descriptions, but won't exclude those objects which don't have at least one English label (another possible interpretation of your question). To do that level of filtering, you'll probably need something more powerful than grep.
Related
Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.
Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]
I have a report card written in Word that uses an XML file for its input. In the XML file, if a student remains in the same section all three trimesters there will be one node for that class; if they change sections at the trimester they'll have one node for each section. The nodes look something like this (greatly simplified):
<ReportCardSectionFB Abs1="2" Abs2="11" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Jones, Jennifer" TermCode="Year" SectionID="ELMATH1-4" />
<ReportCardSectionFB Abs1="1.50" Abs2="6" CourseID="ELMATH1" CourseTitle="Math" PeriodStart="3" TeacherName="Smith, Tina" TermCode="Year" SectionID="ELMATH1-3" />
There is no indicator within the XML as to which trimester the node belongs to.
In the Word document, we're pulling the absence data with the following mail merge command:
{MERGEFIELD "ReportCardSectionFB[#PeriodStart='3']/ #Abs1" \# 0.# \* MERGEFORMAT }
That's not working in this situation: it only gets the absence data from the first node it comes across, i.e.: 2.0. Is there a way to get the sum of #Abs1 for all period 3 classes, i.e.: 3.5? If not, is there a way to only get the last #Abs1 for period 3, i.e.: 1.5?
I recommend you to use this 3rd party product, which can use xml as input and is capable of merging it with MS Word template. I is also much more powerful than the built-in Word's mail merge. You can see some examples here.
You could also try summing the absences in Synergy - there's a new checkbox under AttDef1, 2, etc. that adds up all the absences for the data range - Include all day data for the entire date range regardless of section enrollment or section timeframe. That way the absences should be the same for each section, if that works for your district.
You can also try the SET function in Word to nest the MERGEFIELDS as bookmarks and use the Word operator functions to then add the bookmarks.
My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.
SuperCollider has a String:parseYAML method that can create a nested Dictionary:
"{44: 'woo'}".parseYAML
Dictionary[ (44 -> woo) ]
But how to go the other way, output a YAML string given a (possibly nested) Dictionary?
[answer is from someone else outside]
Does the document have to be readable?
I've ben using JSON.stringify from Felix's API quark In order to share dictionaries with an Max MSP application.
The result from this method is not readable, that is, it doesn't generate any newlines and tabs etc. So it doesn look pretty in a text document, but that's not the intention with method design I can imagine.
I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]