Google Natural Language Syntax Analysis - syntax

I'm not sure this is a coding issue/question.
I'm using Google's NLP to analyze the syntax of some sentences and I'm seeing some inconsistencies with Plural vs Singular designation. Perhaps I'm doing something wrong or misunderstanding what I see as an inconsistency.
For example.
The dolphins jump over the wall
The word dolphins is labeled as "SINGULAR" and I was expecting "PLURAL". I thought, maybe cause it's referring to the group, as ONE "school of fish"(although they are mammals)
So I tried Crows
The crows jump over the wall
The crows are jumping over the wall
Both of these return crows as "SINGULAR", which I thought would be consistent since a group of crows is ONE "Murder of Crows"
Ok, fine then I tried Cows - a group of cows is ONE Herd
The cows jump over the wall
But in this sentence, the word cows is labeled "PLURAL".
I'm no linguistics expert that maybe be a cause of my confusion.
Or is this "inconsistency" due to analyzing the sentence ONLY using the analyzeSyntax API without analyzing its sentiment or the entities?
This is the log for The cows jump over the wall.
{ theSentence: 'The cows jump over the wall.',
theTags: [ 'DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT' ],
theLabels: [ 'DET', 'NSUBJ', 'ROOT', 'PREP', 'DET', 'POBJ', 'P' ],
theNumbers:
[ 'NUMBER_UNKNOWN',
'PLURAL',
'SINGULAR',
'NUMBER_UNKNOWN',
'NUMBER_UNKNOWN',
'SINGULAR',
'NUMBER_UNKNOWN' ]
This is the log for The crows jump over the wall.
{ theSentence: 'The crows jump over the wall.',
theTags: [ 'DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT' ],
theLabels: [ 'DET', 'NSUBJ', 'ROOT', 'PREP', 'DET', 'POBJ', 'P' ],
theNumbers:
[ 'NUMBER_UNKNOWN',
'SINGULAR',
'SINGULAR',
'NUMBER_UNKNOWN',
'NUMBER_UNKNOWN',
'SINGULAR',
'NUMBER_UNKNOWN' ]
Update : I tried using https://language.googleapis.com/v1beta2/documents:analyzeSyntax and I get the same results

Related

calculating with the values of links

I need to calculate a connection value between myself and all of my nearPersons which uses among others the trust value of the link. However, getting the trust values of all links and using these in the computation slows down the running time. I did not succeed to find another way to do it more efficiently. Any suggestions would be highly appreciated!
breed [persons person]
undirected-link-breed [connections connection]
connections-own [trust]
persons-own
[
nearPersons
familiarity
]
to setup
clear-all
setupPersons
setupConnections
updateConnections
reset-ticks
end
setupPersons
create-persons 1000
[
set color black
set grouped false
setxy random-xcor random-ycor
]
end
to setupConnections
ask persons [create-connections-with other persons]
ask connections [ set trust 0.4]
end
to updateConnections
while [(count persons with [grouped = false] / 1000) > 5]
[
let highlyTrusted n-of (2 + random 9) (persons with [grouped = false])
ask highlyTrusted
[
ask my-out-connections with [member? other-end highlyTrusted] [set trust 0.6]
set grouped true
]
]
end
to go
getNearPersons
calculateConnection
forward 1
end
to getNearPersons
ask persons [ set nearPersons other persons in-cone 3 360 ]
end
to calculateConnection
ask persons with [nearPersons != nobody]
[
ask nearPersons
[
let degreeOfTrust [trust] of in-connection-from myself
]
]
end
Here is the model using agentsets of trust rather than links. I think that it will give you identical results, as long as there are only those two trust levels. I made a few other changes in the code - in particular your original while statement would never run as it was written (the quotient was always 1). Most are simply matters of style. Let me know if this is not what you need. Still not a speed demon, about 5 seconds per tick with 1000 persons, but a lot faster than the link version. Note that nPersons is a global so that I could play around with it.
Charles
globals [nPersons]
breed [persons person]
persons-own
[
nearPersons
Trusted
unTrusted
familiarity
grouped
]
to setup
clear-all
set nPersons 1000
ask patches [ set pcolor gray ]
setupPersons
updateConnections
reset-ticks
end
to setupPersons
create-persons nPersons
[
set color black
set grouped false
setxy random-xcor random-ycor
]
ask persons [
set Trusted no-turtles
set unTrusted persons
set familiarity 1
]
end
to updateConnections
while [(count persons with [grouped = false] / nPersons) > 0.2]
[
let highlyTrusted n-of (2 + random 9) (persons)
; so persons can be in more than one trust group and treat all trust groups equally?
ask highlyTrusted
[
set Trusted other highlyTrusted
set grouped true
]
]
end
to go
ask persons [ getNearPersons ]
calculateConnection
ask persons [ forward 1 ]
tick
end
to getNearPersons
ask persons [ set nearPersons other persons in-radius 3 ]
end
to calculateConnection
ask persons with [any? nearPersons]
[
let affiliation []
ask nearPersons
[
let degreeOfTrust ifelse-value (member? myself Trusted) [0.6] [0.4]
; let degreeOfTrust [trust] of in-connection-from myself ;;this line causes netlogo to run very slowly
set affiliation lput (degreeOfTrust * familiarity) affiliation
]
]
end
I had to make a few modifications of your code to make it run, I've included it below. But I believe the real problem is just the scale of what you are doing. With 1000 persons, there are approximately half a million links that you are repeatedly polling, and given the size of your world, the number of nearPersons for each person is likely quite large. Do you really need 1000 persons in your model, and/or do you really need every person to be connected to every other person? The number of links goes up exponentially with the number of persons.
breed [persons person]
undirected-link-breed [connections connection]
connections-own [trust]
persons-own
[
nearPersons
familiarity
grouped
]
to setup
clear-all
setupPersons
show "persons setup"
setupConnections
show "connections setup"
updateConnections
show "connections updated"
reset-ticks
end
to setupPersons
create-persons nPersons
[
set color black
set grouped false
setxy random-xcor random-ycor
]
end
to setupConnections
ask persons [create-connections-with other persons]
ask connections [ set trust 0.4]
end
to updateConnections
while [(count persons with [grouped = false] / 1000) > 5]
[
let highlyTrusted n-of (2 + random 9) (persons)
ask highlyTrusted
[
ask my-out-connections with [member? other-end highlyTrusted] [set trust 0.6]
set grouped true
]
]
end
to go
ask persons [ getNearPersons ]
show mean [count nearPersons] of persons
ask persons [ calculateConnection ]
show "got Connections"
ask persons [ forward 1 ]
tick
end
to getNearPersons
ask persons [ set nearPersons other persons in-cone 3 360 ]
end
to calculateConnection
ask persons with [nearPersons != nobody]
[
let affiliation []
let idx 0
ask nearPersons
[
let degreeOfTrust [trust] of in-connection-from myself ;;this line causes netlogo to run very slowly
set affiliation insert-item idx affiliation (degreeOfTrust * familiarity)
set idx idx + 1
]
]
end

RGeo - fix self-intersections

I have a bunch of polygons that have self-intersection which causes some errors in further postprocessing them (in particular - I can't calculate intersection area of those polygons with other polygons). Here is an example of broken polygon:
{
"type": "MultiPolygon",
"coordinates": [
[
[
[
6.881057785381658,
46.82373306675715
],
[
6.857171686909481,
46.81861230543794
],
[
6.857354659059071,
46.81856788926046
],
[
6.856993473052509,
46.82693029065604
],
[
6.8612894138116785,
46.83422796373707
],
[
6.86720955648855,
46.835636765630476
],
[
6.871281147359957,
46.83078486366309
],
[
6.871573291317274,
46.8306215963777
],
[
6.877608228639841,
46.82771553607934
],
[
6.877758462659651,
46.82772313420989
],
[
6.877852632482749,
46.827735617670285
],
[
6.880928107931434,
46.82630213148064
],
[
6.8810399979122305,
46.82622029042867
],
[
6.881117606743071,
46.826115612819855
],
[
6.881057785381658,
46.82373306675715
]
]
]
]
}
This is what it looks like on the map - as you can see, there is intersection of two polygon edges. RGeo throws an error, pointing intersection coordinate (I guess): => "Geos::GEOSException: TopologyException: Input geom 0 is invalid: Self-intersection at or near point 6.8573510795579145 46.818650764080992 at 6.8573510795579145 46.818650764080992". So, I have it at least.
My question is: is there a way to fix that intersection automatically? I read, that a possible solution is to insert 2 similar points with coordinates of self-intersection. But the problem is - the polygon has a specific order, and I don't know WHERE to insert those points.
Also, maybe there are some existing tools helping fix that...
The solution I would use is postgis's ST_MakeValid option for postgres if that is an option for you you could do something along the lines of ST_AsText(ST_MakeValid(geom_column)) or if you would rather pass in the text here is an example using the bowtie example shown in prepair:
select ST_AsText(ST_MakeValid(ST_GeomFromText('POLYGON((0 0, 0 10, 10 0, 10 10, 0 0))')));
st_astext
-----------------------------------------------------------
MULTIPOLYGON(((0 0,0 10,5 5,0 0)),((5 5,10 10,10 0,5 5)))
(1 row)
If that doesn't interest you, you could export those geometries and use a tool like prepair to convert them. To sum up how this works behind the scenes, it will split these "bowties" into multiple polygons which will then be made into a multipolygon. The same type of fix will applied to multipolygons.

Generic restructure of ruby hash (with depth of 4)

Is it possible to convert this HASH into an array of arrays based solely on the position of the key (rather than it's value). ie: I know ahead of time that the first Key will always be PROD/ALPHA, and the second Key will always be a country (that I would like to be able to change in the future at will)
The idea would be to group all servers of the same type (webservers) that are also in the same environment (production) but are located in different farms (UK, USA)
While any suggestions on how to do this are welcome, I'll be happy to just know that I'm not walking into a dead-end I won't be able to solve.
Here are some visuals to aid in my explanation:
{
"PROD": {
"USA": {
"generic": [
"nginx-240"
],
"WEB": [
"nginx-210",
"nginx-241",
"nginx-211",
"nginx-209"
],
"APP": [
"tomcat-269",
"tomcat-255",
"tomcat-119",
"tomcat-124"
]
},
"UK": {
"WEB": [
"nginx-249",
"nginx-250",
"nginx-246",
"nginx-247",
"nginx-248"
],
"generic": [
"tomcat-302"
],
"APP": [
"tomcat-396",
"tomcat-156",
"tomcat-157"
]
}
},
"ALPHA": {
"USA": {
"WEB": [
"nginx-144",
"nginx-146",
"nginx-145",
"nginx-175",
"nginx-173"
],
"APP": [
"tomcat-204",
"tomcat-206"
]
}
}
}
The expectation is that data from the lowest level in the hash would be grouped together.
Again the idea is that all Production app servers (both from UK and USA) are grouped together in the following kind of pattern:
PROD_UK_APP would be represented by
["tomcat-396","tomcat-156","tomcat-157"] as these are the lowest branches of the tree PROD->UK->applicationserver
[
[
[PROD_UK_APP],[PROD_USA_APP]
],
[
[PROD_UK_WEB],[PROD_USA_WEB]
]
]
New list..
[
[
[ALPHA_USA_WEB]
],
[
[ALPHA_USA_APP],
[
[
Again the idea is to keep this generic. Is this something that is practically achievable or am I likely to require some degree of hardcoding to ensure it always works? The idea is that if tomorrow UK becomes JAPAN, it will still work in exactly the same way, comparing between the APP and WEB tier of UK, and JAPAN (separating ALPHA from PROD).
EDIT: my attempt to try and sort it:
def walk
a = []
myhash.each do |env, data|
data.each do |dc, tier|
tier.each do |x, y|
a << y
end
end
end
p a
end
[["nginx240"], ["nginx210", "nginx241", "nginx211", "nginx209"], ["tomcat269", "tomcat255", "tomcat119", "tomcat124"], ["nginx249", "nginx250", "nginx246", "nginx247", "nginx248"], ["tomcat302"], ["tomcat396", "tomcat156", "tomcat157"], ["nginx144", "nginx146", "nginx145", "nginx175", "nginx173"], ["tomcat204", "tomcat206"]]
Thanks,
I think I follow what you're looking for and you should get what you're after with:
myhash.values.each_with_object([]) do |by_country, out_arr|
by_country.values.each do |by_type|
out_arr << by_type.values
end
end
which would return:
[
[
[
"nginx-240"
],
[
"nginx-210",
"nginx-241",
"nginx-211",
"nginx-209"
],
[
"tomcat-269",
"tomcat-255",
"tomcat-119",
"tomcat-124"
]
],
[
[
"nginx-249",
"nginx-250",
"nginx-246",
"nginx-247",
"nginx-248"
],
[
"tomcat-302"
],
[
"tomcat-396",
"tomcat-156",
"tomcat-157"
]
],
[
[
"nginx-144",
"nginx-146",
"nginx-145",
"nginx-175",
"nginx-173"
],
[
"tomcat-204",
"tomcat-206"
]
]
]
Piece by piece
Take your hash, disgard the keys and just create an array of values.
iterate over the values (array of hashes by country) and initialize an array to return.
for each hash that by_country points to, again take the values, to drop into the by type(?) hashes
iterate over your by_type hashes and again take the values of each
push each return array into the array you want to return

Identify text-based format

Is the data below in a well-known format, or is this a custom format invented by the generator?
[{
"tmsId": "MV006574730000",
"rootId": "11214341",
"subType": "Feature Film",
"title": "Doctor Strange 3D",
"releaseYear": 2016,
"releaseDate": "2016-11-04",
"titleLang": "en",
"descriptionLang": "en",
"entityType": "Movie",
"genres": ["Action", "Adventure", "Fantasy"],
"longDescription": "Dr. Stephen Strange's (Benedict Cumberbatch) life changes after a car accident robs him of the use of his hands.
When traditional medicine fails him, he looks for healing, and hope,
in a mysterious enclave. He quickly learns that the enclave is at the
front line of a battle against unseen dark forces bent on destroying
reality. Before long, Strange is forced to choose between his life of
fortune and status or leave it all behind to defend the world as the
most powerful sorcerer in existence.",
"shortDescription": "Dr. Stephen Strange discovers the world of magic after meeting the Ancient One.",
"topCast": ["Benedict Cumberbatch", "Chiwetel Ejiofor", "Rachel McAdams"],
"directors": ["Scott Derrickson"],
"officialUrl": "http://marvel.com/doctorstrange",
"ratings": [{
"body": "Motion Picture Association of America",
"code": "PG-13"
}],
Well this is indeed JSON format. I suppose the chunk of data you are giving us here are not the complete data. Because there missing some closing brackets. Well if you delete the last comma "," and put there these: "}]".
Then as you can see it passes validation in the jsonlint.
You can try this here: jsonlint.com

Highlight part of code block

I have a very large code block in my .rst file, which I would like to highlight just a small portion of and make it bold. Consider the following rst:
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
**Example 1: Explain showing a table scan operation**::
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}
When it converts to html, it syntax highlights by default (good), but I also want to specify a few lines that should be bold (the ones with comments on them, but possibly others too.)
I was thinking of adding a trailing character sequence on the line (.e.g. ###) and then writing a post-parser script to modify the html files generated. Is there a better way?
The code-block directive has an emphasize-lines option. The following highlights the lines with comments in your code.
**Example 1: Explain showing a table scan operation**
.. code-block:: python
:emphasize-lines: 7, 11, 12
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}

Resources