Apache Tika server request to get 'main content' instead of 'plain text' - ruby

I am experimenting with Apache Tika: app & server, gui and command line.
With Tika app, I can do something like
java -jar tika-app-1.7.jar --gui
and choose 'View' -> 'Main content', or
java -jar tika-app-1.7.jar --text-main http://www.cnn.com/2015/07/09/politics/russian-bombers-u-s-intercept-july-4/index.html
I need main content, but it seems in a server mode I can only get plain text. I am checking this guide.
curl -s "http://amzn.com/B005IWM8PU" | curl -X PUT -T - http://<server_ip>:9998/meta
curl -s "http://amzn.com/B005IWM8PU" | curl -X PUT -T - http://<server_ip>:9998/tika
Maybe, something that comes after http://:9998/ will do the trick?
Is there any way do get main content in a server mode?
At the end, the request has to be made in Ruby, tika-server-1.3.jar. So far looks like this:
require "net/http"
tika_prefix = URI('http://<server_ip>:9998/tika')
url = 'http://www.cnn.com/2015/07/09/politics/russian-bombers-u-s-intercept-july-4/index.html'
request = Net::HTTP::Put.new(tika_prefix.to_s)
request.body = url
request.content_type = 'text/html'
http = Net::HTTP.start(tika_prefix.hostname, tika_prefix.port)
http.request(request).body

This is possible as of today. Tika 1.15 now implements TIKA-2343 feature request, which adds --text-main equivalent in server mode.
vaites/php-apache-tika is a PHP binding for Tika that I use, and I've opened an issue regarding this, so we should be able to see it being implemented soon.
EDIT: The PHP Binding library now supports this feature.

Related

How can I download a maven package from GitLab with curl or wget?

I have a maven package (dummy) in my Gitlab Package Registry that I want download with a curl or wget command.
Following this I tried:
curl --user "username:DEPLOY_TOKEN" \
"https://gitlab.com/api/v4/projects/666/packages/maven/dummy/0.0.1-SNAPSHOT/dummy-0.0.1-SNAPSHOT.jar"
but I have:
{"message":"404 Project Not Found"}
The project id is correct.
How can I download the maven package?
To download any package, including a maven one, you will need to use the Packages API.
Following those docs, you need to use:
curl --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/:id/packages/:package_id"
Assuming the 666 in the description is the project ID, then it'd be:
curl --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.com/api/v4/projects/666/packages/:package_id"
but you would still need to figure out the package id.
If you don't know the package id, you can use the packages API to list the available packages in the project first.
The endpoint you're using looks like it's from the Maven API documentation page, which specifically states it's not meant for manual consumption, so it's not the recommended method.
If you need to use that endpoint anyway, (as per the note at the top of the page) you need to follow the package registry authentication documentation.
This means that if you want to use a deploy token, you need to make sure your deploy token has read_package_registry, write_package_registry, or both.
Your curl command would then look like this:
curl --header "Deploy-Token: <token>" "https://gitlab.com/api/v4/projects/666/packages/maven/dummy/0.0.1-SNAPSHOT/dummy-0.0.1-SNAPSHOT.jar"
Your download script using curl should be like this.
GLB_PRIVATE_TOKEN=<private-token>;
GLB_GROUP_PJT_ID="<numeric project id>";
MAVEN_GROUP_ID="<maven-group-id replace . with />";
MAVEN_ARTIFACT_ID="<maven-artifact-id>";
MAVEN_ARTIFACT_VERSION="<maven-artifact-version>"
GLB_ARTIFACT_FILE_NAME="<maven-artifact-version w/o SNAPSHOT>-<file-specific-number as found in gitlab>";
FILE_TYPE=".jar";
echo "Running curl for $MAVEN_ARTIFACT_ID-$GLB_ARTIFACT_FILE_NAME"
curl --header "Private-Token: $GLB_PRIVATE_TOKEN" "https://gitlab.com/api/v4/projects/$GLB_GROUP_PJT_ID/packages/maven/$MAVEN_GROUP_ID/$MAVEN_ARTIFACT_ID/$MAVEN_ARTIFACT_VERSION/$MAVEN_ARTIFACT_ID-$GLB_ARTIFACT_FILE_NAME$FILE_TYPE" >> <filename>.<filetype>
In your case your GLB_ARTIFACT_FILE_NAME would be 0.0.1- < some hyphenated number > .jar

Need a 'text' parameter to parse duckling rasa x

I am trying to run duckling locally. So with the help of this article I installed stack, and then
cloned duckling code
git clone https://github.com/facebook/duckling.git
download the zoneinfo and updated the reference in exe/ExampleMain.hs
let defaultPath = "duckling/exe/zoneinfo/"
let fallbackPath = "exe/zoneinfo/"
build using
stack build
then run using
stack exec duckling-example-exe
now if i hit http://localhost:8000/parse in the postman with request type POST and with following content
{
"text": "tommorow",
"locale": "de_DE",
"tz": "Europe/Berlin",
"dims": [
"time"
],
"reftime": 1616571265000
}
it shows 422 bad input
Need a 'text' parameter to parse
and if i hit the same request again it shows 200 OK
quack!
any help?
I see that you are trying to send the request as a JSON, however, the "http://localhost:8000/parse" endpoint expects the input to be sent as "form-encoded" data.
Refer to this image for a sample snapshot - https://i.stack.imgur.com/Cqdz4.png
You can check the source code of RASA open source. They are using requests python library to use duckling inside RASA for data parsing.
Here is the source code, here
It will be so useful to know the correct format of text data.
Also, I will show you how to use duckling through a simple example:
Be sure that you compile and run the binary:
$ stack build
$ stack exec duckling-example-exe
Insdie pythod code environment or any IDE that support python run the following:
import requests
t = requests.post('http://0.0.0.0:8000/parse', data={'text':'tomorrow at eight', 'locale':'en_GB'})
print(t.text)
The output is
[{"body":"tomorrow at eight","start":0,"value":{"values":[{"value":"2021-09-27T08:00:00.000-07:00","grain":"hour","type":"value"},{"value":"2021-09-27T20:00:00.000-07:00","grain":"hour","type":"value"}],"value":"2021-09-27T08:00:00.000-07:00","grain":"hour","type":"value"},"end":17,"dim":"time","latent":false}]

nexus configure initial repositories non-interactively

I would like to create a docker for our nexus instance with the correct repositories, proxies etc already created.
Inspired by this question I started using the script API to configure my repositories. The repositories configured through this API don't work like the ones configured manually though (how sad; especially if you imagine the trouble I went through to get the configuration done with the non-documented script API...). I have already filed a bug therefore if you really want to know the details: https://issues.sonatype.org/browse/NEXUS-19891
Now my question: is there another way to configure the repositories non-interactively?
For jenkins it is possible to put some default configuration in /usr/share/jenkins/ref which will then be used only at the first startup; to give you an initial configuration. I was wondering if something similar exists for nexus? Or some other way that I don't know about?
I use python to do something similar to this:
curl -X POST -u admin:admin123 --header 'Content-Type: application/json' http://localhost:8081/service/rest/v1/script -d '{"name":"test","type":"groovy","content":"repository.createYumProxy('\''test'\'', '\''http://repository:8080/'\'')"}'
curl -X POST -u admin:admin123 --header "Content-Type: text/plain" 'http://127.0.0.1:8081/service/rest/v1/script/test/run'
the exact script that I post (more readable here than with all those escaped quotes):
repository.createYumProxy('{name}', '{url}');
configuration = repository.repositoryManager.get('{name}').configuration.copy();
configuration.attributes['proxy'] = [
remoteUrl : "{url}",
contentMaxAge : 0,
metadataMaxAge : 0
]
configuration.attributes['negativeCache'] = [
timeToLive : 1.0
]
repository.repositoryManager.update(configuration)
The part that was missing in my case was the repositoryManager.update(). As quoted on the ticket:
I think the important item(s) missing from your script is that you are not updating the repositoryManager with the new (copied) configuration (which causes the repository to stop/start and therefore reload config)

create project for sonarqube with the rest-api / web-api

we try to automate the creation of projects (including user/group Management) in sonarqube and I already found the Web-API-documentation in our sonarqube 5.6-Installation. But if I try to create a project with the following settings
JSON-File create-project.json:
{"key": "test1", "name": "Testprojekt1"}
curl-request
curl --noproxy '*' -D -X POST -k -u admin:admin -H 'content-type: application/json' -d create_project.json http://localhost:9000/api/projects/create
I get the Error:
{"err_code":400,"err_msg":"Missing parameter: key"}
It's a bit strange because if I try e.g. the URL:
http://localhost:9000/api/projects/index
I get the list of the projects I created manuelly and if I try a request like
curl -u admin:admin -X POST 'http://localhost:9000/api/projects/create?key=myKey&name=myProject'
it works too, but I would like to use the new api because it looks like it support much more function that the 4.X API of sonarqube.
Maybe someone here can help me with this problem, if would very thanksful for every useful hint.
best regards
Dan
I found this question because I got the same "parameter missing" error message.
So what we both did not understand: The SQ API expects the parameters as plain URL parameters and not as json formatted parameters as most REST APIs do today.
PS: Would be nice if this could be added to the SQ documentation.

adding 'curl' command to sinatra and rails?

(GAVE UP ON INSTALLING CURB. POSTED NEW QUESTION PER SUGGESTION OF ONE OF THE RESPONDENTS)
I thought 'curl ' was 'built-in' but got an undefined method error in a sinatra app. is there a gem i need to add?
Same question for rails 3?
The application is that I have to simply 'hit' an external url (http://kickstartme.someplace.com?action=ACTIONNAME&token=XYZXYZXYZ) to kickstart a remote process.
the external url returns XML describing success/failure in the format:
<session>
<success>true</success>
<token>xyzxyzxyz</token>
<id>abcabcabc</id>
</session>
So really, ALL I need is for my rails and sinatra apps to hit that url and parse whatever is returned AND grcefully handle the remote server failing to reply.
require 'open-uri'
require 'nokogiri'
response = open("http://kickstartme.someplace.com?action=ACTIONNAME&token=XYZXYZXYZ").read
doc = Nokogiri::XML(response)
Use curb, a Ruby binding to libcurl. You will get all the curl features without having to shell out with system.
curl -b "auth=abcdef; ASP.NET_SessionId=lotsatext;" example.com
turns into
curl = Curl::Easy.new('http://example.com/')
curl.cookies = 'auth=abcdef; ASP.NET_SessionId=big-wall-of-text;'
curl.perform
More curb examples

Resources