This is an exceptionally general question that likely has a yes/no answer.
Let's say we have a line of shoes in our retail store.
Size 5 and Size 6 both have different assigned barcodes, as I've learned is the standard.
Great, we can now track them as different products.
Barcodes have a manufacturer identifier on the left, and a product identifier on the right.
My question: if we look at the barcodes for the Size 5 and Size 6 of our shoes, can we ever know that they are both from the same line of product? Just from the barcode?
As far as I can see, there is no such information within barcodes. The two products are simply variants, yet their barcodes make them appear completely different. One could be a shoe, and one could be a pack of birthday balloons.
Or, can we tell, from a barcode, that two products are actually variants (in this cases, sizes) of the same product?
We could, of course, do a barcode lookup with an API, but there does not seem to be, in any of the JSON data I've looked at, any way to associate them with each other. Looking at MPN numbers, also, this does not seem to be a thing.
Titles can be similar, but they are rarely exactly the same.
Welcome to SO. I work with barcodes of different kinds.
The question you have does in principle have a yes/no answer, if you think of regular "retail" barcodes such as EAN13 or UPC. These correctly as you say have 4 parts: [country][company][item][checkdigit]. In this way, each product is unique by itself, and unrelated to other products.
But not all barcodes are like this. Other barcodes may contain data identifiers in a structured way. The most common is the GS1 identifiers. Different barcodes ("symbologies") can carry this structure, common variants are EAN128, Datamatrix/GS1 and QR/GS1, but others may be used as well. Using this logic, you can create a relation between your products, using things like LOT code or internal company numbers.
Finally, nothing prevents you from creating your own barcoding strategy using a separate code on the product/box. It could be a code128 with a structure you build yourself, such as product family, type, ...
Related
I am looking into options of auto indexing of daily documents generated by tellers in bank operations. The documents does not have any reference number and its handwritten by customer.
So to auto index these documents and store in EDMS, we have to put the core bank transaction reference number on each. So what options do i have? Print barcode label contains this trans number and attach to paper? or have a machine that i can feed the paper and it can print barcode on it?
Anyone know what is the right HW or SW for this?
Thanks
Depends on how complex you want to be. Perhaps these documents could be multiple (stapled?) pages. would you want to index each page - and would the documents then form an associated sequence (eg. doc. 00001-01 to -20)
Next caper is to consider the form of the number. It's best to formulate a check-digiting system so that a printed number can be manually entered and the check-digit verifies that the number hasn't been miskeyed.
Now - if these documents may be different sizes for instance, or potentially a wad of paperwork, how would you feed them through a printer?
So I'd suggest that a good choice would be to produce your numbers on a specialist barcode-printer with human-readable line on the same label. Some idiot will want to insist on using cheap thermally-sensitive labels, but these almost inevitably deteriorate with time. I'd choose thermal-transfer labels which are a little more complex - your tellers would need to be able to load label-rolls and also the transfer-ribbon (a little like a typewriter-ribbon, if you remember those) but basically any monkey could do it.
Even then, there are three grades of ribbon - wax, resin and a combination. Problem with wax is that it can become worn - same thing as you get with laser-printing where the pages get stuck together if they are left to their own devices for a while. Another reason you don't use laser printers in this role - apart from the fact that you'd need to produce sheets of labels to attach rather than ones and twos on-demand is that the laser processing will cook the glue on the sheets. Fine for an address label with a lifetime of a few days, but disastrous when you may be storing documents for years. Document goes one way, label another...
Resin is the best but most expensive choice. It has better wearing characteristics.
My choice would be a Zebra TLP2824plus using thermal-transfer paper and resin ribbon. Software is easy - just means you need to go back 20 years in time and forget all about drivers - just send a sring to the printer as if it was a generic text printer. The formatting of the label - well, the manual will show you that...
Other technologies and approaches would probably be more complex than simply producing and attaching barcode labels. For instance, if you were to have an inkjet printer like those that are used to mark (milk/juice) cartons - well, it would have to deal with different sizes of paper, and different weights from near-cardboard to airmail paper. It would also have a substantial footprint since the paper would need to be physically presented to the printer. Then there's all the problems of disassembling and reassembling a stapled wad. And who can control precisely where the printing would occur? What may suit one document may not suit another - it may have inconveniently-placed logos or other artwork in the "standard" position for that-sized paper.
Another issue is colour. There's no restriction on background colour with a label (yellow or fluoro pink for example) - it would be easy to locate when necessary. Contrast that with the-ink's-running-low washed-out ink printing on a grey background. White labels wouldn't stand out all that well on the majority of (white) documents.
BUT a strong alternative technology would be to have reels of labels pre-printed by a commercial printing establishment rather than producing them with a special printer on-demand. Reels are better than sheets - they are easier to use especially for people with short fingernails.
I have a large set of data (several hundred thousand records) that are unique entries in a CSV. These entries are essentially products that are being listed in a store from a vendor that offers these products. The problem is that while they offer us rights to copy these verbatim or to change wording, I don't want to list them verbatim obviously since Google will slap the ranking for having "duplicate" content. And then, also obviously, manually editing 500,000 items would take a ridiculous amount of time.
The solution, it would seem, would be to leverage fuzzy logic that would take certain phraseology and transform it to something different that would not then be penalized by Google. I have hitherto been unable to find any real library to address this or a solid solution that addresses such a situation.
I am thinking through my own algorithms to perhaps accomplish this, but I hate to reinvent the wheel or, worse, be beaten down by the big G after a failed attempt.
My idea is to simply search for various phrases and words (sans stop words) and then essentially map those to phrases and words that can be randomly inserted that still have equivalent meaning, but enough substance to hopefully not cause a deranking situation.
A solution for Ruby would be optimal, but absolutely not necessary as any language can be used.
Are there any existing algorithms, theories or implementations of a similar scenario that could be used to model or solve such a scenario?
Is there open data format for representing such GIS data as roads, localities, sublocalities, countries, buildings, etc.
I expect that format would define address structure and names for components of address.
What I need is a data format to return in response to reverse geocoding requests.
I looked for it on the Internet, but it seems that every geocoding provider defines its own format.
Should I design my own format?
Does my question make any sense at all? (I'm a newbie to GIS).
In case I have not made myself clear I don't look for such data formats as GeoJSON, GML or WKT, since they define geometry and don't define any address structure.
UPD. I'm experimenting with different geocoding services and trying to isolate them into separate module. I need to provide one common interface for all of them and I don't want to make up one more data format (because on the one hand I don't fully understand domain and on the other hand the field itself seems to be well studied). The module's responsibility is to take partial address (or coordinates) like "96, Dubininskaya, Moscow" and to return data structure containing house number (96), street name (Dubininskaya), sublocality (Danilovsky rn), city (Moscow), administrative area (Moskovskaya oblast), country (Russia). The problem is that in different countries there might be more/less division (more/less address components) and I need to unify these components across countries.
Nope there is not unfortunately.
Why you may ask
Beacuse different nations and countries have vastly different formats and requirements for storing addresses.
Here in the UK for example, defining a postcode has quite a complex set of rules, where as ZIP codes in the US, are 4 digit numerical prefixed with a simple 2 letter state code.
Then you have to consider the question what exactly constitutes an address? again this differences not just from country to country, but some times drastically within the same territory.
for example: (Here in the UK)
Smith and Sons Butchers
10 High street
Some town
Mr smith
10 High street
Some town
The Occupier
10 High Street
Some Town
Smith and Sons Butchers
High Street
Some Town
Are all valid addresses in the UK, and in all cases the post would arrive at the correct destination, a GPS however may have trouble.
A GPS database might be set up so that each building is a square bit of geometry, with the ID being the house number.
That, would give us the ability to say exactly where number 10 is, which means immediately the last look up is going to fail.
Plots may be indexed by name of business, again that s fine until you start using person names, or generic titles.
There's so much variation, that it's simply not possible to create one unified format that can encompass every possible rule required to allow any application on the planet to format any geo-coded address correctly.
So how do we solve the problem?
Simple, by narrowing your scope.
Deal ONLY with a specific set of defined entities that you need to work with.
Hold only the information you need to describe what you need to describe (Always remember YAGNI* here)
Use standard data transmission formats such as JSON, XML and CSV this will increase your chances of having to do less work on code you don't control to allow it to read your data output
(* YAGNI = You ain't gonna need it)
Now, to dig in deeper however:
When it comes to actual GIS data, there's a lot of standard format files, the 3 most common are:
Esri Shape Files (*.shp)
Keyhole mark up Language (*.kml)
Comma separated values (*.csv)
All of the main stay GIS packages free and paid for can work with any of these 3 file types, and many more.
Shape files are by far the most common ones your going to come across, just about every bit of Geospatial data Iv'e come across in my years in I.T has been in a shape file, I would however NOT recommend storing your data in them for processing, they are quite a complex format, often slow and sequential to access.
If your geometry files to be consumed in other systems however, you can't go wrong with them.
They also have the added bonus that you can attach attributes to each item of data too, such as address details, names etc.
The problem is, there is no standard as to what you would call the attribute columns, or what you would include, and probably more drastically, the column names are restricted to UPPERCASE and limited to 32 chars in length.
Kml files are another that's quite universally recognized, and because there XML based and used by Google, you can include a lot of extra data in them, that technically is self describing to the machine reading it.
Unfortunately, file sizes can be incredibly bulky even just for a handful of simple geometries, this trade off does mean though that they are pretty easy to handle in just about any programming language on the planet.
and that brings us to the humble CSV.
The main stay of data transfer (Not just geo-spatial) ever since time began.
If you can put your data in a database table or a spreadsheet, then you can put it in a CSV file.
Again, there is no standards, other than how columns may or may not be quoted and what the separation points are, but readers have to know ahead of time what each column represents.
Also there's no "Pre-Made" geographic storage element (In fact there's no data types at all) so your reading application, also will need to know ahead of time what the column data types are meant to be so it can parse them appropriately.
On the plus side however, EVERYTHING can read them, whether they can make sense of them is a different story.
I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).
Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.
I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").
Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.
Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.
However, based on the information provided, I'd suggest the following:
Start by Building a word-based (rather than letter based) unigram or bigram model for each of your categories, based on the reference documents. If there are only few of these for each category (It seems you might have only one), you could use a semi-supervised approach, and throw in also the automatically categorized documents for each category. A relatively simple tool for building the model might be the CMU SLM toolkit.
Calculate the mutual-information (infogain) of each term or phrase in your model, with relation to other categories. if your categories are similar, you might need you use only neighboring categories to get meaningful result. This step would give the best separating terms higher scores.
Correlate the categories to each other based on the top-infogain terms or phrases. This could be done either by using euclidean or cosine distance between the category models, or by using a somewhat more elaborated techniques, like graph-based algorithms or hierarchic clustering.
I am tasked with trying to create an automated system that removes personal information from text documents.
Emails, phone numbers are relatively easy to remove. Names are not. The problem is hard because there are names in the documents that need to be kept (eg, references, celebrities, characters etc). The author name needs to be removed from the content (there may also be more than one author).
I have currently thought of the following:
Quite often personal names are located at the beginning of a document
Look at how frequently the name is used in the document (personal names tend to be written just once)
Search for words around the name to find patterns (mentions of university and so on...)
Any ideas? Anyone solved this problem already??
With current technology, doing what what you are describing in a fully automated way with a low error rate is impossible.
It might be possible to come up with an approximate solution, but it would still make a lot of errors...... either false positives or false negatives or some combination of the two.
If you are still really determined to try, I think your best approach would be Bayseian filtering (as used in spam filtering). The reason for this is that it is quite good at assigning probabilities based on relative positions and frequencies of words, and could also learn which names are more likely / less likely to be celebrities etc.
The area of machine learning that you would need to learn about to make an attempt at this would be natural language processing. There are a few different approaches that could be used, bayesian networks (something better then a naive bayes classifier), support vector machines, or neural nets would be areas to research. Whatever system you end up building would probably need to use an annotated corpus (labeled set of data) to learn where names should be. Even with a large corpus, whatever you build will not be 100% accurate, so you would probably be better off setting flags at the names for deletion instead of just deleting all of the words that might be names.
This is a common problem in basic cryptography courses (my first programming job).
If you generated a word histogram of your entire document corpus (each bin is a word on the x-axis whose height is frequency represented by height on the y-axis), words like "this", "the", "and" and so forth would be easy to identify because of their large y-values (frequency). Surnames should at the far right of your histogram--very infrequent; given names towards the left, but not by much.
Does this technique definitively identify the names in each document? No, but it could be used to substantially constrain your search, by eliminating all words whose frequency is larger than X. Likewise, there should be other attributes that constrain your search, such as author names only appear once on the documents they authored and not on any other documents.