Hyphen encoding (minus) in Google Base RSS feed - utf-8

I am trying to create an automatic feed generation for data to be sent to Google Base using utf-8 encoding.
However I am getting errors whenever hyphens are found telling me that there is an encoding error in the relevant attribute (title, description, product_type). I am currently using:
−
but I have also tried:
−
neither of which have worked.
I am using the following declaration at the top of the document:
<?xml version="1.0" encoding="utf-8"?>
Ok to give further context to this the data is being pulled from our site's product information stored as utf-8 encoded data in a MYSQL database. The data is going into an RSS 2.0 feed, using the some standard RSS attributes as well as some custom defined Google attributes. The problem comes up whenever there is a hyphen in any field except the link field. So it is appearing in the title and description fields as well as the custom product_type field. Below is an example of a field that Google Base (merchant centre) throws an error over. It throws the same error with or without the other entities and only stops objecting when hyphens are removed.
<description><p>Your sports floor is designed primarily for sports use. Thou many facilities have to be used for other activities including things like; assemblies careers fairs drama parties and social events bring and buy sales exhibitions etc.</p>
<p>Solid hardwood sports floors are designated as "area elastic floors" to provide the spring resilience and shock absorbing qualities needed for sports and dance use to minimise injury. If the floor is too hard the athlete and user will be exposed to early fatigue and aching joints through to injury such as sprains joint and shin bone damage.</p>
<p>If too soft then ball bounce and running characteristics are compromised.
In the UK hardwood sports floors are governed by a number of recognised standards</p>
<p>All sports floors must conform to BS7044 Part 4 - this is the minimum Sport England requirement with which your floor msut comply if it is part of a Sport England sponsored project.</p>
<p>A higher more demanding standard for better quality sports and dance flooring is DIN 18032 Part 2</p>
<p>The newest - and the best - standard is the European Standard CEN 217. This standard has brought together all the best eprformance criteria from a number of current standards in the EU including BS and DIN.</p>
<p>All Junckers systems fully comply with one or more of these standards. They ALL comply with the minimum Sport England requirement of BS7044 Part 4 compliance.</p></description>

You talk about using hyphens, but the character you're trying to insert is the mathematical minus sign. Have you tried it with an actual hyphen? And not a HTML entity, either; just the character, -.

Related

Optimize Google Places API Query for Prominent Parks, Mountains, Conservation Areas

First post on Stackoverflow.
I am using the Google API to sort images taken while traveling into organized folders, append tags and rename files with relevant information. I have my code working well but am not always happy with the results. I want to be able to focus my query results on major tourist attractions such as National Parks, Ski Resorts, Beaches, etc. The problem I am finding is that the prominence "rankby" variable and the "radius" are not giving satisfactory results. Here is a typical query for Zion National Park.
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.269486111111,-112.948141666667&rankby=prominence&radius=50000&type=natural_feature,tourist_attraction,point_of_interest&keyword=&key=MYAPIKEY
The most prominent result is Springdale which is the town where you enter the part. Zion National Park is listed much further down in the results. What my code does is use the LAT and LON extracted using EXIF and does a Google API nearby search request to find the Place ID for where the photo was taken. It then does another API request for Place Details using the place_id provided by the previous step to cut down on the information I need to parse.
https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJ8R5RCzaNyoARegi3rqVkstk&fields=name,address_component&key=MYAPIKEY
I can force the nearby search to return a National Park by searching against "National Park" in the keywords variable but that limits my project to only being able to provide National Park results since the keywords field can only accept one string.
I would like a park of my query to be able to return the most prominent tourist attraction at the general level, i.e. Zion National Park, Yosemite National Park, etc. so I can sort images into the general name folders and another part of the query provides the exact location. i.e. I am on this trail or at this lookout. The problem is the Google API sees these specific locations "Trail, Lookout" as tourist attractions, parks, establishments, etc. as well so it chooses those first.
What I need help with is trying to figure out if there is a better way to structure my query to return the high-level / name of the major park. From my understanding, the types field only searches on the first type even if there is more in the list and the keywords field can only accept one string as well making it impossible for one phase to capture all major destinations at a high level.
Perhaps it needs to be done with more queries but I am trying to limit the number of queries to stay inside the free quota. Maybe it will just take a long time to fully sort my files.
Read through and implemented Google API structure. I hoping someone can provide a more detailed query structure or method to parse out truly prominent locations rather than googles interpretation of prominence as it can be affected by user ratings, etc. It is not always accurate.

What is the difference between a concept and a label in XBRL, and do all listed companies share the same US GAAP labels?

Let me show tesla's company facts data with sec's RESTful api:
https://data.sec.gov/api/xbrl/companyfacts/CIK0001318605.json
You can see all labels in 'facts ---- us-gaap' such as :
AccountsAndNotesReceivableNet
AccountsPayableCurrent
AccountsReceivableNetCurrent
AccretionAmortizationOfDiscountsAndPremiumsInvestments
Do all listed companies share same us-gaap label names ?
Can every company create its own customerized us-gaap label names?
concept in xbrl is A taxonomy element that provides the meaning for a fact in the official definition.
https://www.xbrl.org/guidance/xbrl-glossary/
What is the difference between concept in xbrl and us-gaap's label ?
The short answer is yes.
First, a small detail:
AccountsAndNotesReceivableNet
AccountsPayableCurrent
AccountsReceivableNetCurrent
AccretionAmortizationOfDiscountsAndPremiumsInvestments
These are not labels, these are local names of concepts. Labels are something different, human readable, for example "Accounts and notes receivable, net" would be a label. Labels are attached with the label linkbase.
The more complete names (called QNames) of these concepts are:
us-gaap:AccountsAndNotesReceivableNet
us-gaap:AccountsPayableCurrent
us-gaap:AccountsReceivableNetCurrent
us-gaap:AccretionAmortizationOfDiscountsAndPremiumsInvestments
where the us-gaap prefix is bound with the US GAAP namespace, which changes every year and is, for 2021:
http://fasb.org/us-gaap-std/2021-01-31
This makes explicit that these concepts are not maintained by companies, but by the Financial Accounting Standards Board. Thus, all companies filing their reports into the EDGAR system share these concepts.
Two important points:
Companies are allowed to create their own concepts. These are called extension concepts. You will recognize them because they are in a company namespace, not in the US GAAP namespace. Their prefix will not be us-gaap, but some company-specific prefix. These concepts are unique to each company.
An example for Tesla is:
tsla:AccruedAndOtherCurrentLiabilities
Concepts in the US GAAP taxonomy are updated every year, i.e., some get added, some get deprecated, some are removed. However, the FASB tries to maintain consistency across years, i.e., a concept will not suddently change its semantics one year to the next.

what address information should I collect when developing an international signup for a website

I have used google to obtain an address from a postcode and the like before. My problem is I want my website to have address fields such that anyone in any country can sign up properly and provide all necessary address information. I will include a feature to enter postcode and obtain all other information automatically.
Is it reasonable for me to check the postcode and force a successful google lookup before someone signs up? If so I could just store the JSON string in the database as a blob or maybe inside a class. But I still need to decide what fields, such as street name, postcode or zip, and the like to include. I'm not sure where to begin deciding what to include?
I think what I'm really asking, is what fields are associated with what google fields in general. I know the different administrative levels are different things in different countries :/
While I can't say what you should specifically do for Google, I can tell you what fields our customers use when they validate international addresses online. (Full Disclosure: I'm a programmer at SmartyStreets where we validate international addresses.)
While each country's mailing system is unique, there are a few major similarities that they all share. This element of commonality is what allows you to have people enter their address into a universal form and then validate the address, regardless of the country in question.
Address Line 1: This field is usually the house or building number and the street which the building is located. Examples of this field include: 123 Main Street, Calle Proc. San Sebastián, 15, 1019 North 1300 West, etc.
Address Line 2: This field would include apartment or suite numbers.
Locality: The most common data entered for this is the city component of the address. For example: Paris, Hamburg, Johannesburg, etc.
Administrative Area: This is the state or province name or abbreviation. Examples of this would be Texas - TX, Alberta - AB, Firenze (Italy) - FI.
Postal Code (where available): Examples of this would be 90210 (Beverly Hills in California) or 84000 (Avingon in France).
While you can always add additional fields to give additional context to a software parser or interpreter, the above fields are the most common ones that you would use for international address validation. If you're not sure, you can test a non-US address for free. We offer extensive documentation that is both free and publicly visible to help better explain the nuances and idiosyncrasies of street and mailing addresses.

Kofax Seperate Main Invoice from Supporting Document without using Seperator sheet

When a batch gets created documents should get separated automatically without using separator sheet or Barcode separator.
How can I classify documents for Invoice and supporting document.
In our project we get many invoices with supporting document so the scanning person has to insert the separator sheets manually, so to avoid this we want to automatically classify the supporting documents.
In general the concept would be that you would enable separation in the project and then train your classes with examples to be used for the layout or content classifiers.
However, as I'm sure you've seen, the obstacle with invoices is that they are different enough between vendors that it would not reliably classify all to an Invoice class. Similarly with "Supporting Documents" which are likely to be very different from each other, so unfortunately there isn't a completely easy answer without separator sheets (or barcode stickers affixed to supporting docs).
What you might want to do is write code in the one of the separation events like Document_AfterSeparate event. Despite the name, the document has not yet been split at this point, but the classifiers have run. See Scripting Help topic "Server Script Events Sequence > Document Separation > Standard Document Separation" for more detail. Setting the SplitPage property on the CDocPage (pXDoc.CDoc.Pages.ItemByIndex(lPage).SplitPage) will allow you to use your own logic to determine which pages to separate.
For example if you know that you will always have single page invoices, you can split on the first page and classify accordingly. Or you can try to search for something that indicates the end of the invoice like "Total" or other characteristics. There is an example of how you can use locators to help separation in the Scripting Help topic "Script Samples > Use Locator Results for Standard Document Separation". The example uses a Barcode Locator, but the same concept works if you wanted to try it with a Format Locator or anything else.
Without Separator sheets you will need a smart classification software like Kofax Transformation Module (KTM). Its kind of expensive. you will need to verify the cost saving and ROI.

Programmatically find common European street names

I am in the middle of designing a web form for German and French users. Within this form, the users would have to type street names several times.
I want to minimize the annoyance to the user, and offer autocomplete feature based on common French and German street names.
Any idea where I can a royalty-free list?
Would your users have to type the same street name multiple times? Because you could easily prevent this by coding something that prefilled the fields.
Another option could be to use your user database as a resource. Query it for all the available street names entered by your existing users and use that to generate suggestions.
Of course this would only work if you have a considerable number of users.
[EDIT] You could have a look at OpenStreetMap with their Planet.osm dumbs (or have a look here for a dump containing data for just Europe). That is basically the OSM database with all the map information they have, including street names. It's all in an XML format and streets seem to be stored as Ways. There are tools (i.e. Osmosis) to extract the data and put it into a database, or you could write something to plough through the data and filter out the street names for your database.
Start with http://en.wikipedia.org/wiki/Category:Streets_in_Germany and http://en.wikipedia.org/wiki/Category:Streets_in_France. You may want to verify the Wikipedia copyright isn't more protective than would be suitable for your needs.
Edit (merged from my own comment): Of course, to answer the "programmatically" part of your question: figure out how to spider and scrape those Wikipedia category pages. The polite thing to do would be to cache it, rather than hitting it every time you need to get the street list; refreshing once every month or so should be sufficient, since the information is unlikely to change significantly.
You could start by pulling names via Google API (just find e.g. lat/long outer bounds - of Paris and go to the center) - but since Google limits API use, it would probably take very long to do it.
I had once contacted City of Bratislava about the street names list and they sent it to me as XLS. Maybe you could try doing that for your preferred cities.
I like Tom van Enckevort's suggestion, but I would be a little more specific that just looking inside the Planet.osm links, because most of them require the usage of some tool to deal with the supported formats (pbf, osm xml etc)
In fact, take a look at the following link
http://download.gisgraphy.com/openstreetmap/
The files there are all in .txt format and if it's only the street names that you want to use, just extract the second field (name) and you are done.
As an fyi, I didn't have any use for the French files in my project, but mining the German files resulted (after normalization) in a little more than 380K unique entries (~6 MB in size)
#dusoft might be onto something - maybe someone at a government level can help? I don't think that a simple list of street names cannot be copyrighted, nor any royalties be charged. If that is the case, maybe you could even scrape some mapping data from something like a TomTom?
The "Deutsche Post" offers a list with all street names in Germany:
http://www.deutschepost.de/dpag?xmlFile=link1015590_3877
They don't mention the price, but I reckon it's not for free.

Resources