Use of Heritrix's HtmlFormCredential and CredentialStore - spring

I am attempting to add authentication to my Heritrix configuration. My .cxml file has the following:
<bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">
<property name="credentialStore">
<ref bean="credentialStore" />
</property>
</bean>
<bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
<property name="credentialStore">
<ref bean="credentialStore" />
</property>
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.recrawl.IdenticalDigestDecideRule">
<property name="decision" value="REJECT" />
</bean>
<bean class="org.archive.modules.deciderules.ResourceNoLongerThanDecideRule">
<property name="contentLengthThreshold" value="54" />
<property name="useHeaderLength" value="true" />
<property name="decision" value="REJECT" />
</bean>
</list>
</property>
</bean>
</property>
</bean>
<bean id="exampleCredential" class="org.archive.modules.credential.HtmlFormCredential">
<property name="domain" value="example.com" />
<property name="loginUri" value="https://example.com/user?destination=%2f" />
<property name="formItems">
<map>
<!-- username/password -->
<entry key="name" value="something#something.com"/>
<entry key="pass" value="genericpassword"/>
<!-- hidden inputs -->
<entry key="form_build_id" value="form-asdf" />
<entry key="form_id" value="user_login" />
<!-- submit -->
<entry key="op" value="submit"/>
</map>
</property>
</bean>
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
<property name="credentials">
<map>
<entry key="exampleCredential" value-ref="exampleCredential" />
</map>
</property>
</bean>
I also set the logging for FetchHTTP and PreconditionEnforcer to FINE, but nothing seems to be happening. No logging output is appearing from either modules and the pages that are pulled down clearly are those of an un-authenticated view.
I find it somewhat unclear as to how to use the CredentialStore, considering that I've spent a good amount of time reading through the specifications, which are patchy at best when it comes to authentication and websites.
Anyone know how to set up authentication in Heritrix, please help.
Update:
Logging didn't work because eclipse didn't know about my HERITRIX_HOME variable, so it never even read the logging configuration file.
I changed the bean exampleCredential's domain property from:
<property name="domain" value="example.com" />
to:
<property name="domain" value="www.example.com" />
and now the login page is enqueued, but now the logger spits out the following for all queued files:
org.archive.crawler.prefetch.PreconditionEnforcer.innerProcessResult() PolitenessEnforcer doesn't understand uri's of type dns (ignoring)
org.archive.modules.deciderules.ResourceNoLongerThanDecideRule.evaluate() Error: Missing HttpMethod object in CrawlURI. dns:secure.www.example.com
and none of the files are downloaded or crawled. So though I made progress, it didn't lead me anywhere. There is not much logging information to go off of.

I also asked this question on the Heritrix forms: http://tech.groups.yahoo.com/group/archive-crawler/message/8235 and Noah Levitt had the idea to add the login page as a seed to my crawl. Everything now seems to be working without much issue.
My conclusion is that I had everything set up correctly in my config file, but was missing the actual page seed that I needed.

Related

Creating multiple SchedulerFactoryBean in Quartz

I have run into a problem where I have two classes extending QuartzJobBean. The problem I am facing is to create two SchedulerFactoryBean. I did my research and found that setting the property schedulerName solves the problem. This did not work for me. I used #Qualifier also. If I create the two beans of SchedulerFactoryBean , Spring doesn't know which scheduler to refer to. I have two CronTriggers.
Code:
<!--
<bean name="quartzSchedulerR" class="org.springframework.scheduling.quartz.JobDetailBean">
<property name="jobClass" value="com.task.QuartzScheduler" />
<property name="jobDataAsMap">
<map>
<entry key="rRSImpl" value-ref="rRSService" />
<entry key="SRObject" value-ref="SRObject"/>
</map>
</property>
</bean>
<bean id="cronTriggerR"
class="org.springframework.scheduling.quartz.CronTriggerBean">
<property name="jobDetail" ref="quartzSchedulerR" />
<property name="cronExpression" value="0 30 12 ? * MON *" />
</bean>
<bean id="quartzScheduler" class="org.springframework.scheduling.quartz.SchedulerFactoryBean" lazy-init="false">
<property name="schedulerName" value="scheduleOne"/>
<property name="schedulerContextAsMap">
<map>
<entry key="rSchedulerServiceImpl" value-ref="rSchedulerServiceImpl"></entry>
</map>
</property>
<property name="jobDetails">
<list>
<ref bean="quartzSchedulerR" />
</list>
</property>
<property name="triggers">
<list>
<ref bean="cronTriggerR" />
</list>
</property>
</bean> -->
<bean id ="quartzScheduler" class="org.springframework.scheduling.quartz.SchedulerFactoryBean" lazy-init="false">
<property name="schedulerContextAsMap">
<map>
<entry key="rSSImpl" value-ref="rSSImpl"></entry>
</map>
</property>
</bean>
<bean id="jobDetailFactory" class="org.springframework.beans.factory.config.ObjectFactoryCreatingFactoryBean">
<property name="targetBeanName">
<idref local="jobDetail" />
</property>
</bean>
<bean id="jobDetail" class="org.springframework.scheduling.quartz.JobDetailBean" scope="prototype">
<property name="jobClass" value="com.scheduler.SMTPMailJob " />
<property name="jobDataAsMap">
<map>
<entry key="rSSeImpl" value-ref="rSSImpl" />
<entry key="fUtil" value-ref="fUtil" />
<entry key="rService" value-ref="rService" />
<entry key="fusion" value-ref="fusion"/>
<entry key="fcproperties" value-ref="fcproperties"/>
</map>
</property>
</bean>
<bean id="jobTriggerFactory"
class="org.springframework.beans.factory.config.ObjectFactoryCreatingFactoryBean">
<property name="targetBeanName">
<idref local="jobTrigger" />
</property>
</bean>
<bean id="jobTrigger" class="org.springframework.scheduling.quartz.CronTriggerBean"
scope="prototype">
</bean>
Currently I comment out the first scheduler and the application works as expected. But if I uncomment it, the second scheduler stops working. Any workaround for this issue.. ? Any help is appreciated.
EDIT: There is no error but I know that the job isn't scheduled. The error is basically that spring doesn't find a unique bean for com.quartz.Scheduler when both the SchedulerFactoryBean are defined. Basically how to configure multiple SchedulerFactoryBean for totally isolated classes.. ?
Please can you explain why do you need two scheduarfactorybeans.
As per spring doc, FactoryBean that creates and configures a Quartz Scheduler, manages its lifecycle as part of the Spring application context, and exposes the Scheduler as bean reference for dependency injection.
This means you cant have multiple instances of the bean.
Also as per your code you are trying to schedule multiple jobs which can be done using one schedularFactoryBean. Create Multiple job beans and their corresponding triggers and add them as list to the schedular factoryBean. All the triggers will be invoked as per configuration irrespective of whether it is a cron trigger or simple trigger.

Spring-Hibernate Caching with Memcached

I have an application whose back end has been made with Spring and Hibernate.
I Want to apply memcaching to make the application more scalable. At first i thought that i could integrate the second level cache of hibernate with memcache but the problem arouse was that all the HQL written in the application r like book.grade.id where Book & Grade are two separate entities, hence, the second level cache mechanism failed.
Can anyone recommend me a way to implement caching? I have had a look at EHCache but i want the Memcache implementation for now. My application will be hit by several servers but only 1 Database Server will exist. Given the required conditions, any recommendations?
Below mentioned are the steps you can follow.
pom.xml changes to include the abstract cache mechanism for memcache and client implementation using xmemcache.
com.google.code.simple-spring-memcached
spring-cache
3.1.0
<dependency>
<groupId>com.google.code.simple-spring-memcached</groupId>
<artifactId>xmemcached-provider</artifactId>
<version>3.1.0</version>
</dependency>
Note : You need to include cglib too as this is aop based.
configuration.xml file changes
**defining beans**
<bean name="cacheManager" class="com.google.code.ssm.spring.SSMCacheManager">
<property name="caches">
<set>
<bean class="com.google.code.ssm.spring.SSMCache">
<constructor-arg name="cache" index="0" ref="defaultCache"/>
<!-- 5 minutes -->
<constructor-arg name="expiration" index="1" value="300"/>
<!-- #CacheEvict(..., "allEntries" = true) doesn't work -->
<constructor-arg name="allowClear" index="2" value="false"/>
</bean>
</set>
</property>
</bean>
<bean name="defaultCache" class="com.google.code.ssm.CacheFactory">
<property name="cacheName" value="defaultCache" />
<property name="cacheClientFactory">
<bean name="cacheClientFactory"
class="com.google.code.ssm.providers.xmemcached.MemcacheClientFactoryImpl" />
</property>
<property name="addressProvider">
<bean class="com.google.code.ssm.config.DefaultAddressProvider">
<property name="address" value="x.x.x.x:11211" />
</bean>
</property>
<property name="configuration">
<bean class="com.google.code.ssm.providers.CacheConfiguration">
<property name="consistentHashing" value="true" />
</bean>
</property>
</bean>
Sample method...
#Cacheable(value="defaultCache", key="new Integer(#id).toString().concat('.BOOKVO')")
public BookVO getBookById(Integer id){
...
}
with this changes your method will hit the db only if the key is not found in the memcache server.

Heritrix: how to exclude everything but pdf from mirroring?

I found this topic How do i exclude everything but text/html from a heritrix crawl?
I have changed bean to this
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^application/pdf.*"/>
</bean>
</property>
</bean>
But heritrix still saves every file to mirror dir.
I believe you are missing a reject rule above your accept rule. I have the following that works:
<property name="shouldProcessRule">
<bean class="org.archive.modules.deciderules.DecideRuleSequence">
<property name="rules">
<list>
<bean class="org.archive.modules.deciderules.RejectDecideRule">
</bean>
<bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
<property name="decision" value="ACCEPT" />
<property name="regex" value="^application/pdf.*"/>
</bean>
</list>
</property>
</bean>
</property>
This rejects everything, then accepts everything listed in the following rules.

How to clear methodcacheinterceptor cache

I am developing a web application where I wrote a method, in that method I called a json url and parsing that and store in a pojo class.
I just cache this method using methodcacheinterceptor concept, My xml configuration below:
<bean id="methodCacheInterceptor"
class="web.app.services.MethodCacheInterceptor">
<property name="cache">
<ref local="methodCache" />
</property>
</bean>
<bean id="methodCache" class="org.springframework.cache.ehcache.EhCacheFactoryBean">
<property name="cacheManager">
<ref local="cacheManager" />
</property>
<property name="cacheName">
<value>videoItemsCache</value>
</property>
</bean>
<bean id="cacheManager"
class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean">
<property name="configLocation">
<value>/WEB-INF/ehcache.xml</value>
</property>
</bean>
<bean id="methodCachePointCut"
class="org.springframework.aop.support.RegexpMethodPointcutAdvisor">
<property name="advice">
<ref local="methodCacheInterceptor" />
</property>
<property name="patterns">
<list>
<value>.*web.app.services.ItemCollectionImpl.getCollection</value>
<value>.*web.app.services.ItemCollectionImpl.buildCollection</value>
</list>
</property>
</bean>
I set a time of 20 mins caching & it works perfectly!
All I want is for a particular activity I want to override & clear those cache, For the next time I visit page I want that whole method to be executed and data should not be loaded from cache (I want to clear the cache) for that particular activity.
FYI
CacheManager.getInstance().getCacheNames() - this prints my cachename which I have given in the xml configuration.
I have tried CacheManager.getInstance().clearAll(), removeall methods but it doesn't works!!
Anyone help me here!
Thanks in advance.

Loading .properties file in a jar from my web-app

I have created a JAR that I need to use in my WEB-APP. Both are created with spring framework. I would like to load a .properties file outside the JAR file, in the main context of the web-application. And I want to do it with the facilities that Spring offers us.
I've tried to do something like this in my spring.xml file inside the JAR:
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="location">
<value>/WEB-INF/classes/my.properties</value>
</property>
</bean>
<bean class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
<property name="triggers">
<bean id="cronTrigger" class="org.springframework.scheduling.quartz.CronTriggerBean">
<property name="jobDetail" ref="myJob" />
<property name="cronExpression" value="${my.cronExpression}"/>
</bean>
</property>
</bean>
Trying to load my.cronExpression from my.properties file. But without any success.
I always get this error:
Could not resolve placeholder 'my.cronExpression'.
I've tried to change the location with many variants, using classpath:/WEB-INF/classes/my.properties etc...
But I'm not able to load the configuration file.
Thanks for your help.
Use classpath:my.properties - /WEB-INF/classes is root of your classpath.
Try declaring it as follows:
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="location">
<value>/WEB-INF/classes/my.properties</value>
</property>
<property name="ignoreUnresolvablePlaceholders">
<value>true</value>
</property>
</bean>
I have gone through your code and want you to try this code snippet
It works well for me :)
<bean id="placeholderProperties"
class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="location" value="file:/WEB-INF/classes/my.properties" />
<property name="systemPropertiesModeName" value="SYSTEM_PROPERTIES_MODE_OVERRIDE" />
<property name="ignoreUnresolvablePlaceholders" value="true" />
<property name="order" value="1" />
</bean>

Resources