Crawl image and their metadata using nutch and index them into solr - image

I want to build a mini image based search engine to which I can provide image file and it will search for similar images in the solr. I'm using nutch for the crawling part and indexing the data into solr. I've done changes into nutch conf files like -
Added image/* into mimetype-filter.txt
Removed image extensions from suffix-urlfilter.txt - not to skip them
I also added fields into solr schema.xml -
<field name="name" type="string" indexed="true" stored="true" />
<field name="iso" type="string" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="string" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />
But when I crawl, there is no data that is indexed into solr. I'm unable to find any documentation/tutorial regarding this. I've also gone through some posts on stackoverflow for image crawling using nutch. But I didn't find those helpful.
Can someone please guide me to the right direction regarding how to proceed ? Thanks in advance.

There is no easy/short answer for this issue, parsing images is a tricky business, even without involving the crawling part. On top of what you've already done you need first to enable the parse-tika plugin (parse-html only deals with HTML documents). Apache Tika is able to extract some metadata about the images.
You also need to enable the mimetype-filter plugin (this is not only editing the config file but enabling in the nutch-site.xml file). After these configurations are done you should try the bin/nutch parsechecker <URL> tool to test a URL that contains some images and see if you can find the URLs to the images in the Outlinks section. Also, check running the parsechecker against an image URL to see what metadata the parsechecker is extracting. After this, run the bin/nutch indexchecker tool against both URLs and check which fields it is going to index into Solr and create those in your schema accordingly. Keep in mind that Tika may extracty different metadata for each format.

Related

KendoPivot and Mondrian XML/A server

I would like to know if anyone has implemented KendoPivot accessing an instance of Mondrian as XML/A server. In theory this should work but I'm wondering if there are any compatibility issues.
I tested KendoPivot and Mondrian, and it works well. What you need to take into account is that you have to name your hierarchies, if you don't it'll take the default name (the dimension name). For example you need to include name="theStore":
<Dimension name="Store">
<Hierarchy hasAll="true" primaryKey="store_id" name="theStore">
<Table name="store"/>
<Level name="Store Country" column="store_country" uniqueMembers="true"/>
<Level name="Store State" column="store_state" uniqueMembers="true"/>
.....
.....

Updating data in a XML file in marklogic using java api

Say I inserted XML file in MarkLogic datastore:
<Providers>
<Provider>
<UniqueId>1111</UniqueId>
<name>John Doe</name>
<Age>40</Age>
<Country>MN</Country>
</Provider>
<Provider>
<UniqueId>2222</UniqueId>
<name>Johny Deep</name>
<Age>51</Age>
<Country>NY</Country>
</Provider>
</Providers>
Now if I want to update the name to 'Jane Doe' where Unique Id is '1111', how can I achieve this using MarkLogic's java API?
Goel, it sounds like you need the Patch operation. This allows you to specify a specific part of a document and add to, change, or delete it.

How to declaratively deploy a localized datetime field with validation?

The following column definition DOES work.
<Field ID="{F4313C31-C8DD-4917-98A9-0DE886177758}"
Type="DateTime"
Name="ExpirationDate"
DisplayName="Limited until (if necessary)"
StaticName="ExpirationDate"
Group="SomeGroup"
Required="FALSE"
Format="DateOnly"
FriendlyDisplayFormat="Disabled"
CalType="0">
<Validation Message="Please select a date in the future.">=[Limited until (if necessary)]>TODAY()</Validation>
But of course I do not want to use the display name in my validation formula.
Moving closer to production any display name will be moved a resx file anyway.
And using a resx file I end up with the same error as I encountered before when I was trying to use the internal field name instead of the display name.
The error is: The formula cannot refer to another column. Check the formula for spelling mistakes or update the formula to reference only this column.
This is what SharePoint does itself when creating a column via the UI:
<Field Type="DateTime" DisplayName="RS Expiration Date" Required="FALSE" EnforceUniqueValues="FALSE" Indexed="FALSE" Format="DateOnly" FriendlyDisplayFormat="Disabled" ID="{15380d60-50d7-4ce1-b21b-92695f0c0811}" SourceID="{8086fd7d-ca0b-4258-9352-f166615b6159}" StaticName="RSExpDate" Name="RSExpDate" ColName="datetime2" RowOrdinal="0" CalType="0" Version="1">
<Validation Message="Please enter a future date." Script="function(x){return SP.Exp.Calc.valid(SP.Exp.Node.f('GT',[SP.Exp.Node.a(0),SP.Exp.Node.f('TODAY',[])]),x)}">=RSExpDate>TODAY()</Validation>
<ValidationDisplayNames>=[RS Expiration Date]>TODAY()</ValidationDisplayNames>
Obviously there's a lot of information in there you won't need.
The interesting part is the validation part. It is using "ValidationDisplayNames" instead of "Validation". Still the latter too does only work with display names.

fusebox 5.5 noxml folder name problems

I am having trouble with fusebox 5.5 noxml and circuits...
I have a structure that looks like this.
controller
app.cfc
model
main
act_comm_main.cfm
monkey
act_something_else.cfm
view
main
dsp_comm_main.cfm
monkey
dsp_somethingElse.cfm
In the app.cfc file I have this:
<cffunction name="postfuseaction">
<cfargument name="myFusebox" />
<cfargument name="event" />
<!--- do the layout --->
<cfset myFusebox.do( action="layout.lay_template" ) />
</cffunction>
<cffunction name="main">
<cfargument name="myFusebox" />
<cfargument name="event" />
<!--- do model fuse --->
<cfset myFusebox.do( action="moneky.act_somethingElse" ) />
<!--- do model fuse --->
<cfset myFusebox.do( action="main.act_comm_main" ) />
<!--- do display fuse and set content variable body --->
<cfset myFusebox.do( action="main.dsp_comm_main", contentvariable="body" ) />
</cffunction>
</cfcomponent>
This doesn't work. but if I change it to have the view folder named: mainPages so and then change the cfset myFusebox. do to look at mainPages.dsp_comm_main (it comes up) but in the instance above it give me this error:
undefined Fuseaction
You specified a Fuseaction of dsp_comm_main which is not defined in
Circuit main.
I remove the parsed files and let fusebox rebuild but I still get this error.
So I know how to work around it by naming my directories different between the model and view folders but why is this happening and what can I do to get to resolve same named directories across the model view?
This is because in Fusebox models and views are just a convention to implement MVC. Technically they just a circuits, explicit or implicit, doesn't matter.
Circuit name must be unique within the application, so you have to name the folders differently.
Personaly I've used naming like vMain/mMain, vMonkey/mMonkey for more complex apps with many view circuits. For simpler apps it could be enough to have just layout and display view circuits, this way models can be named without prefix.

Get a count of a linked collection using OData and LINQ

I set up the OData feed for Stack Overflow as outlined in the wonderful article Using LINQPad to Query Stack Overflow and I want to do something like:
Users.Where(x=>x.Badges.Count==0).Take(5)
to get the users that have no Badges ("Badges? We don't need no stinkin' badges!"). I get a DataServiceQueryException:
Unfortunately, OData doesn't support aggregate functions - it supports only the limited set of querying functions described here.
Aggregate operators
All aggregate operations are unsupported against a DataServiceQuery,
including the following:
Aggregate
Average
Count
LongCount
Max
Min
Sum
Aggregate operations must either be performed on the client or be
encapsulated by a service operation.
Hopefully Microsoft will enhance the OData client in the future - it is frustrating to (seemingly) have all the power of LINQ and then not be able to use it.
Looks like Badges doesn't have a Count property. This is why the exception occurred.
<EntityType Name="Badge">
<Key>
<PropertyRef Name="Id" />
</Key>
<Property xmlns:p8="http://schemas.microsoft.com/ado/2009/02/edm/annotation" Name="Id" Type="Edm.Int32" Nullable="false" p8:StoreGeneratedPattern="Identity" />
<Property Name="UserId" Type="Edm.Int32" Nullable="true" />
<Property Name="Name" Type="Edm.String" Nullable="true" MaxLength="50" Unicode="true" FixedLength="false" />
<Property Name="Date" Type="Edm.DateTime" Nullable="true" />
<NavigationProperty Name="User" Relationship="MetaModel.BadgeUser" FromRole="Badge" ToRole="User" />
</EntityType>
Probably you'd need to process each User to check whether the Badges navigation property resolves to anything.
Filtering on count of entities in navigation properties is currently not supported (as already noted by Joe Albahari above). In the latest CTP OData supports any and all functions which would allow you to filter on "empty" navigation properties.
See
http://blogs.msdn.com/b/astoriateam/archive/2011/10/13/announcing-wcf-data-services-oct-2011-ctp-for-net-4-and-silverlight-4.aspx to get the latest CTP.
Here is a discussion of the any/all feature:
http://www.odata.org/blog/even-more-any-and-all

Resources