Leftover / Unused Fields from Solr Default 'example' - performance

I started configuring Solr based off of the original example and so there were many random fields that they included for the example.
I might use one or two of them but for the most part they are all just sitting there unused (nothing in my data config mentions them or stores data to them.
Does this create any impact on performance ( even a small amount ) & should I just go ahead and delete them or does it not matter?
Thanks guys

I recommend deleting the fields and associated unused types. Possibly no performance impact, but there are some dynamic fields and copyFields which may cause confusion down the line. Or just build a new config directly instead (you only need two files to start: schema.xml and solrconfig.xml).
If you are doing it by trimming unused fields, make sure to keep or change whatever your uniqueKey points at, whatever your 'df' default field is and _ version_ (no spaces with underscores).
The last one is requires for real-time get and updateLog, which are enabled in example's solrconfig.xml as well. Easier to keep _ version_ than to try removing all those things.
(Update Jan 2017: There is now a presentation video, specifically working through examples and how to clean them)

Related

Do (document) bundle entries always have to be referenced or referencing?

The specification for FHIR documents seems to mandate that all bundle entries in the document resource be part of the reference graph rooted at the Composition entry. That is, they should be the source or the target of a reference relation that traces all the way up to the root entry.
Unfortunately I have not been able to locate all the relevant passages in the FHIR specification; one place where it is spelled out is in 3.3.1 Document Content, but it is not really clear whether this pertains to all bundles of type 'document' (i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document).
The problem with the referencedness requirement lies in the fact that the HAPI validator employs linear search for checking the references. So, if we have to ship N bundle entries full of data to a payor, we have to include a list with N references (one for each data-bearing bundle entry). That leads to N reference searches with O(N) effort during validation, which makes the reference checking complexity effectively quadratic in the number of entries.
This easily brings even the most powerful computers to their knees. Current size contraints effectively cap the number of entries per file at roughly 25000, and the HAPI validator needs several hours to chew through that, even on the most powerful CPUs currently available. Without the references, validation would take less than a minute for the same file.
In our use case, data-bearing entries have no identity outside of the containing bundle file. Practically speaking they would need neither entry.fullUrl nor entry.resource.id, because their business identifiers are contained in included base64 blobs. However, presence or absence of these identifiers has no practical influence on the time needed for validation (fractions of a second even for a 1 GB file), so who cares. It's the list of references that kills the HAPI validator.
Perhaps it would be possible to fulfil the letter of the referencedness requirement by making all entries include a reference to the Composition. The HAPI validator doesn't care either way, so I don't know whether that would be valid or not. But even if it were FHIRly valid, it would be a monstrously silly workaround.
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection', or by using contained resources?
P.S.: for the moment we are using a workaround that cuts the time for validation from hours to less than a minute, but it's a hack, and we currently don't have the resources to fix the HAPI validator. What I'm mostly concerned about is the question how the specifications (profiles) need to be changed in order to avoid the problem I described.
(i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document)
If it is not a document, and not intended to be one, do not use the 'document' Bundle type. If you do, you would me misrepresenting the data which is what FHIR tries to avoid.
It seems like you want to send a collection of resources that are not necessarily related, so
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection'
Yes, I would use 'collection', or maybe a 'batch/transaction' depending on what I want to tell the receiver to do with the data.
The documents page says:
The document bundle SHALL include only:
The Composition resource, and any resources directly or indirectly (e.g. recursively) referenced from it
A Binary resource containing a stylesheet (as described below)
Provenance Resources that have a target of Composition or another resource included in the document
A document is a frozen set of content intended as an attested, human-readable, frozen set of content. If that's not what you need, then use a different Bundle type. However, if you do need the 'document' type, that doesn't mean that systems should necessarily validate all requirements at runtime

Representing multiple values in delimited seperated value file

Currently i'm working on transforming a xml file to delimited seperated file.I was pondering over the idea of representing multiple values of an attribute field..Currently my idea is to represent the values as below:
First Name;Last Name;E-mail id;Description
Fresher;user1;"|email1#abc.com|;|email2#abc.com|";This user joined as fresher.
My question is;Is there is a standard followed for representation of multiple values.?
How is this scenario taken care in common spreadsheet programs available such as Microsoft excel,openoffice calc and lotus notes 123 when imported into .csv file..??
Based on this i want to make changes to my xslt code..
Appreciate any help in this regard..
According to my experiences it is always good to stick to database normalisation standards. There are a lot of information everywhere in the web for further references.
a) When looking in your proposal what I like is to separate each column with semicolon instead of comma. It's easier to import data to any system later especially when you will deal with different (national) standards of number separation symbols
b) However, which I don't like is the 'e-mail' section. There would be a problem in the following areas:
quotation marks are problems- try to avoid them.
don't separate inside e-mail addresses with the same mark as for column separation. Therefore you shouldn't use semicolon there (what I guess- you can have one or few e-mails for each record).
If you can't introduce database normalisation standards I would propose the following small improvements to your idea:
Fresher;user1;email1#abc.com|email2#abc.com;This user joined as fresher
If you provide that kind of data file I think each of vba user would be able to import it to Excel (or any other system) easily and quickly.

What category / filter structure is this using? Nested Interval?

You can see on their category links that it's quite obvious that the only portion of their URL that matters is the small hash near the end of the URL itself.
For instance, Water Heaters category found under Heating/Cooling is:
http://www.lowes.com/Heating-Cooling/Water-Heaters/_/N-1z11ong/pl?Ns=p_product_avg_rating|1
and Water Heaters category found under Plumbing is:
http://www.lowes.com/Plumbing/Water-Heaters/_/N-1z11qhp/pl?Ns=p_product_avg_rating|1
That being said, obviously their structure could be a number of different things...
But the only thing I can think is it's a hex string that gets decoded into a number and denominator but I can't figure it out...
apparently it's important to them to obfuscate this for some reason?
Any ideas?
UPDATE
At first I was thinking it was some sort of base16 / hex conversion of a standard number / denom or something? or the ID of a node and it's adjacency?
Does anyone have enough experience with this to assist?
They are building on top of IBM WebSphere Commerce. Nothing fancy going on here, though. The alpha-numeric identifiers N-xxxxxxx are simple node identifiers that do not capture hierarchical structure in themselves; the structure (parent nodes and direct child nodes) is coded inside the node data itself, and there are tell-tale signs to that effect (see below.) They have no need for nested intervals (sets), their user interface does not expose more than one level at a time during normal navigation.
Take Lowe's.
If you look inside the cookies (WC_xxx) as well as see where they serve some of their contents from (.../wcsstore/B2BDirectStorefrontAssetStore/...) you know they're running on WebSphere Commerce. On their listing pages, everything leading up to /_/ is there for SEO purposes. The alpha-numeric identifier is fixed-length, base-36 (although as filters are applied additional Zxxxx groups are tacked on -- but everything that follows a capital Z simply records the filtering state.)
Let's say you then wrote a little script to inventory all 3600-or-so categories Lowe's currently has on their site. You'd get something like this:
N-1z0y28t /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Kits
N-1z0y28u /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Towers
N-1z0y28v /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Shelves
N-1z0y28w /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Hardware
N-1z0y28x /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Accessories
N-1z0y28y /Closet-Organization/Wood-Closet-Systems/Wood-Closet-Pedestal-Bases
N-1z0y28z /Cleaning-Organization/Closet-Organization/Wood-Closet-Systems
N-1z0y294 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Mix-Match-Mini-Pendant-Shades
N-1z0y295 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Mix-Match-Mini-Pendant-Light-Fixtures
N-1z0y296 /Lighting-Ceiling-Fans/Chandeliers-Pendant-Lighting/Chandeliers
...
N-1z13dp5 /Plumbing/Plumbing-Supply-Repair
N-1z13dr7 /Plumbing
N-1z13dsg /Lawn-Care-Landscaping/Drainage
N-1z13dw5 /Lawn-Care-Landscaping
N-1z13e72 /Tools
N-1z13e9g /Cleaning-Organization/Hooks-Racks
N-1z13eab /Cleaning-Organization/Shelves-Shelving/Laminate-Closet-Shelves-Organizers
N-1z13eag /Cleaning-Organization/Shelves-Shelving/Shelves
N-1z13eak /Cleaning-Organization/Shelves-Shelving/Shelving-Hardware
N-1z13eam /Cleaning-Organization/Shelves-Shelving/Wall-Mounted-Shelving
N-1z13eao /Cleaning-Organization/Shelves-Shelving
N-1z13eb3 /Cleaning-Organization/Baskets-Storage-Containers
N-1z13eb4 /Cleaning-Organization
N-1z13eb9 /Outdoor-Living-Recreation/Bird-Care
N-1z13ehd /Outdoor-Living
N-1z13ehn /Appliances/Air-Purifiers-Accessories/Air-Purifiers
N-1z13eho /Appliances/Air-Purifiers-Accessories/Air-Purifier-Filters
N-1z13ehp /Appliances/Air-Purifiers-Accessories
N-1z13ejb /Appliances/Humidifiers-Dehumidifiers/Humidifier-Filters
N-1z13ejc /Appliances/Humidifiers-Dehumidifiers/Dehumidifiers
N-1z13ejd /Appliances/Humidifiers-Dehumidifiers/Humidifiers
N-1z13eje /Appliances/Humidifiers-Dehumidifiers
N-1z13elr /Appliances
N-1z13eny /Windows-Doors
Notice how entries are for the most part sequential (it's a sequential identifier, not a hash), mostly though not always grouped together (the identifier reflects chronology not structure, it captures insertion sequence, which happened in single or multiple batches, sometimes years and thousands of identifiers apart, at the other end of the database), and notice how "parent" nodes always come after their children, sometimes after holes. These are all tell-tale signs that, as categories are added and/or removed, new versions of their corresponding parent nodes are rewritten and the old, superseded or removed versions are ultimately deleted.
If you think there's more you need to know you may want to further inquire with WebSphere Commerce experts as to what exactly Lowe's might be using specifically for its N-xxxxxxx catalogue (though I suspect that whatever it is is 90%+ custom.) FWIW I believe Home Depot (who also appear to be using WebSphere) upgraded to version 7 earlier this year.
UPDATE Joshua mentioned Endeca, and it is indeed Endeca (those N-xxxxxxx identifiers) that is being used behind Websphere in this case (though I believe since the acquisition of Endeca Oracle is pushing SUN^H^H^Htheir own Java EE "Endeca Server" platform.) So not really a 90% custom job despite the appearances (the presentation and their javascripts are heavily customized, but that's the tip of the iceberg.) You should be able to use Solr as a substitute.

Searching a list of keywords from text files in folders

I have compiled a list of db object names, one name per line, in a text file. I want to know for each names, where it is being used. The target search is a group of folders containing sub-folders of source codes.
Before I give up looking for a tool to do this and start creating my own, perhaps you can help to point to me an existing one.
Ideally, it should be a Windows desktop application. I have not used grep before.
use grep (there are tons of port of this command to windows, search the web).
eventually, use AgentRansack.
See our Source Code Search Engine. It indexes a large code base according to the atoms (tokens) of the language(s) of interest, and then uses that index to quickly execute structured queries stated in terms of language elememnts. It is a kind of super-grep, but it isn't fooled by comments or string literals, and it automatically ignores whitespace. This means you get a lot fewer false positive hits than you get with grep.
If you had an identifier "foo", the following query would find all mentions:
I=foo
For C and Java, you can constrain the types of identifier accesses to Use, Read, Write or Defines.
D=bar*
would find only declarations of identifiers which started with the letters "bar".
You can write more complex queries using sequences of language tokens:
'int' I=*baz* '['
for C, would find declarations of any variable name that contained the letters "baz" and apparantly declared an array.
You can see the hits in a GUI, and one-click navigate to a source code view of any hit.
It is a Windows application. It handles a wide variety of languages: C#, C++, Java, ... and many more.
I had created an SSIS package to load my 500+ source code files that is distributed into some depth of folders belongs to several projects, into a table, with 1 row as 1 line from the files (total is 10K+ lines).
I then made a select statement against it, by cross-applying the table that keeps the list of 5K+ keywords of db objects, with the help of RegEx for MS-SQL, http://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/. The query took almost 1.5 hr to complete.
I know it's a long winded, but this is exactly what I need. I thank you for your efforts in guiding me. I would be happy to explain the details further, should anyone gets interested using my method.
insert
dbo.DbObjectUsage
select
do.Id as DbObjectId,
fl.Id as FileLineId
from
dbo.FileLine as fl -- 10K+
cross apply
dbo.DbObject as do -- 5K+
where
dbo.RegExIsMatch('\b' + do.name + '\b', fl.Line, 0) != 0

How do you manage the String Translation Process?

I am working on a Software Project that needs to be translated into 30 languages. This means that changing any string incurs into a relatively high cost. Additionally, translation does not happen overnight, because the translation package needs to be worked by different translators, so this might take a while.
Adding new features is cumbersome somehow. We can think up all the Strings that will be needed before we actually code the UI, but sometimes still we need to add new strings because of bug fixes or because of an oversight.
So the question is, how do you manage all this process? Any tips in how to ease the impact of translation in the software project? How to rule the strings, instead of having the strings rule you?
EDIT: We are using Java and all Strings are internationalized using Resource Bundles, so the problem is not the internationalization per-se, but the management of the strings.
I'm not sure the platform you're internationalizing in. I've written an answer before on the best way to il8n an application. See What do I need to know to globalize an asp.net application?
That said - managing the translations themselves is hard. The problem is that you'll be using the same piece of text across multiple pages. Your framework may not, however, support only having that piece of text in one file (resource files in asp.net, for instance, encourage you to have one resource file per language).
The way that we found to work with things was to have a central database repository of translations. We created a small .net application to import translations from resource files into that database and to export translations from that database to resource files. There is, thus, an additional step in the build process to build the resource files.
The other issue you're going to have is passing translations to your translation vendor and back. There are a couple ways for this - see if your translation vendor is willing to accept XML files and return properly formatted XML files. This is, really, one of the best ways, since it allows you to automate your import and export of translation files. Another alternative, if your vendor allows it, is to create a website to allow them to edit the translations.
In the end, your answer for translations will be the same for any other process that requires repetition and manual work. Automate, automate, automate. Automate every single thing that you can. Copy and paste is not your friend in this scenario.
Pootle is an webapp that allows to manage translation process over the web.
There are a number of major issues that need to be considered when internationalizing an application.
Not all strings are created equally. Depending upon the language, the length of a sentence can change significantly. In some languages, it can be half as long and in others it can be triple the length. Make sure to design your GUI widgets with enough space to handle strings that are larger than your English strings.
Translators are typically not programmers. Do not expect the translators to be able to read and maintain the correct file formats for resource files. You should setup a mechanism where you can transform the translated data round trip to your resource files from something like an spreadsheet. One possibility is to use XSL filters with Open Office, so that you can save to Resource files directly in a spreadsheet application. Also, translators or translation service companies may already have their own databases, so it is good to ask about what they use and write some tools to automate.
You will need to append data to strings - don't pretend that you will never have to or you will always be able to put the string at the end. Make sure that you have a string formatter setup for replacing placeholders in strings. Furthermore, make sure to document what are typical values that will be replaced for the translators. Remember, the order of the placeholders may change in different languages.
Name your i8n string variables something that reflects their meaning. Do you really want to be looking up numbers in a resource file to find out what is the contents of a given string. Developers depend on being able to read the string output in code for efficiency a lot more than they often realize.
Don't be afraid of code-generation. In my current project, I have written a small Java program that is called by ant that parses all of the keys of the default language (master) resource file and then maps the key to a constant defined in my localization class. See below. The lines in between the //---- comments is auto-generated. I run the generator every time I add a string.
public final class l7d {
...normal junk
/**
* Reference to the localized strings resource bundle.
*/
public static final ResourceBundle l7dBundle =
ResourceBundle.getBundle(BUNDLE_PATH);
//---- start l7d fields ----\
public static final String ERROR_AuthenticationException;
public static final String ERROR_cannot_find_algorithm;
public static final String ERROR_invalid_context;
...many more
//---- end l7d fields ----\
static {
//---- start setting l7d fields ----\
ERROR_AuthenticationException = l7dBundle.getString("ERROR_AuthenticationException");
ERROR_cannot_find_algorithm = l7dBundle.getString("ERROR_cannot_find_algorithm");
ERROR_invalid_context = l7dBundle.getString("ERROR_invalid_context");
...many more
//---- end setting l7d fields ----\
}
The approach above offers a few benefits.
Since your string key is now defined as a field, your IDE should support code completion for it. This will save you a lot of type. It get's really frustrating looking up every key name and fixing typos every time you want to print a string.
Someone please correct me if I am wrong. By loading all of the strings into memory at static instantiation (as in the example) will result in a quicker load time at the cost of additional memory usage. I have found the additional amount of memory used is negligible and worth the trade off.
The localised projects I've worked on had 'string freeze' dates. After this time, the only way strings were allowed to be changed was with permission from a very senior member of the project management team.
It isn't exactly a perfect solution, but it did enable us to put defects regarding strings on hold until the next release with a valid reason. Once the string freeze has occured you also have a valid reason to deny adding brand new features to the project on 'spur of the moment' decisions. And having the permission come from high up meant that middle managers would have no power to change specs on your :)
If available, use a database for this. Each string gets an id, and there is either a table for each language, or one table for all with the language in a column (depending on how the site is accessed the performance dictates which is better). This allows updates from translators without trying to manage code files and version control details. Further, it's almost trivial to run reports on what isn't translated, and keep track of what was an autotranslation (engine) vs a real human translation.
If no database, then I stick each language in a separate file so version control issues are reduced. But the structure is basically the same - each string has an id.
-Adam
Not only did we use a database instead of the vaunted resource files (I have never understood why people use something like that which is a pain to manage, when we have such good tools for dealing with databases), but we also avoided the need to tag things in the application (forgetting to tag controls with numbers in VB6 Forms was always a problem) by using reflection to identify the controls for translation. Then we use an XML file which translates the controls to the phrase IDs from the dictionary database.
Although the mapping file had to be managed, it could still be managed independent of the build process, and the translation of the application was actually possible by end-users who had rights in the database.
The solution we came up to so far is having a small application in Excel that reads all the property files, and then shows a matrix with all the translations (languages as headers, keys as rows). It is quite evident what is missing then. This is send to the translators. When it comes back, then the sheet can be processed to generate the same property bundles back again. So far it has eased the pain somewhat, but I wonder what else is around.
This google book - resource file management gives some good tips
You can use Resource File Management software to keep track of strings that have changed and control the workflow to get them translated - otherwise you end up in a mess of freezes and overbearing version control
Some tools that do this sort of thing - no connection an I haven't actually used them, just researching
http://www.sisulizer.com/
http://www.translationzone.com/en/products/
I put in a makefile target that finds all the .properties files and puts them in a zip file to send off to the translators. I offered to send them just diffs, but for some reason they want the whole bundle of files each time. I think they have their own system for tracking just differences, because they charge us based on how many strings have changed from one time to the next. When I get their delivery back, I manually diff all their files with the previous delivery to see if anything unexpected has changed - one time all the PT_BR (Brazillian Portuguese) strings changed, and it turns out they'd used a PT_PT (Portuguese Portuguese) translator for that batch in spite of the order for PT_BR.
In Java, internationalization is accomplished by moving the strings to resource bundles ... the translation process is still long and arduous, but at least it's separated from the process of producing the software, releasing service packs etc. One thing that helps is to have a CI system that repackages everything any time changes are made. We can have a new version tested and out in a matter of minutes whether it's a code change, new language pack or both.
For starters, I'd use default strings in case a translation is missing. For example, the English or Spanish value.
Secondly, you might want to consider a web app or something similar for your translators to use. This requires some resources upfront, but at least you won't need to send files around and it will be obvious for the translators which strings are new, etc.

Resources