Modifying the content of a ingested pdf - elasticsearch

I have a pipeline created in elasticsearch that ingests a document with an array of pdfs. I want to modify the content field in order to concatenate other fields at the end for searching purposes.
My pipeline is:
client.ingest.putPipeline({
id: 'my-pipeline-id',
body: {
"description" : "Extract attachment information",
"processors" :
[
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data"
}
}
}
}
]
}
}, callback);
I can't add a set processor after the foreach, because I need to access the content of every pdf in order to put the values of that document at the end of the content.
Some example documents are:
let doc = {
matricula: '6789AAA',
bastidor: 'BASTIDOR789',
expediente: '79',
attachments:
[
{
filename: "informe",
data: /* chunk of data in base64 */
},
{
filename: "ivtm_diba",
data: /* another chunk of data in base64 */
}
]
};
The result document will look like this:
{
"_index": "doc",
"_type": "document",
"_id": "AVsy85rwMuPe74hQBT8L",
"_score": 1.2039728,
"_source": {
"attachments": [
{
"filename": "informe",
"attachment": {
"Very very long content",
"date": "2016-06-08T14:01:25Z",
"content_type": "application/pdf",
"language": "es",
"content_length": 3124
}
},
{
"filename": "ivtm_diba",
"attachment": {
"content": "Very long content here",
"content_type": "application/pdf",
"language": "ca",
"content_length": 5657
}
}
],
"expediente": "79",
"matricula": "6789ZXC",
"bastidor": "BASTIDOR789"
}
}
And I want to add to the content field the values of "bastidor", "matricula" and "expediente" fields.
I'm using elasticsearch-js but that's not a requirement.

The elasticsearch _all field can be used in most cases like this.

Related

Is it possible to set new field value when analyzing document being indexed in Elasticsearch?

For example:
when indexing one document into elasticsearch;
i want to analyze a field named description in the document by uax_url_email tokenizer/analyzer;
if description does have any url, put the url into another field named urls array;
finish index this document;
Now i can check whether field urls is empty to know whether description has any url.
Is this possible? Or does analyzer only contributes to the inverted index, not other fields?
You can use Ingest Pipeline Script processor with painless script. I hope this will help you.
POST _ingest/pipeline/_simulate?verbose
{
"pipeline": {
"processors": [
{
"script": {
"description": "Extract 'tags' from 'env' field",
"lang": "painless",
"source": """
def m = /(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])/.matcher(ctx["content"]);
ArrayList urls = new ArrayList();
while(m.find())
{
urls.add(m.group());
}
ctx['urls'] = urls;
""",
"params": {
"delimiter": "-",
"position": 1
}
}
}
]
},
"docs": [
{
"_source": {
"content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
}
}
]
}
Above Pipeline will generate result like below:
{
"docs": [
{
"processor_results": [
{
"processor_type": "script",
"status": "success",
"description": "Extract 'tags' from 'env' field",
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"urls": [
"https://apple.com",
"https://google.com"
],
"content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
},
"_ingest": {
"pipeline": "_simulate_pipeline",
"timestamp": "2022-07-13T12:45:00.3655307Z"
}
}
}
]
}
]
}

couchbase cbimport key generation with key in nest json

I have the following json document to be uploaded to a full text search index
{
"bank_id": {
"country_code": "AT",
"bank_code": "ASPKAT2LXXX",
"bank_code_type": "SWIFT_CODE"
},
"institution": {
"name": "Allgemeine Sparkasse Oberoesterreich Bankaktiengesellschaft"
},
"address": {
"address_line_1": "Promenade 11-13",
"address_line_2": "Linz",
"country_code": "AT"
},
"bank_capabilities": ["INSTANT_CREDIT"],
"payment_network_details": [{
"network": "REAL_TIME_PAYMENTS",
"capabilities": ["INSTANT_CREDIT"]
}]
}
And I want to generate key with country_code and bank_code. How can i accomplish this?
tried with -g %bank_id.country_code%::%bank_id.bank_code% and it is not working.

Apollo readQuery Fails Even Though Target Object is Present?

I'm working on a call to readQuery. I'm getting an error message:
modules.js?hash=2d0033b4773d9cb6f118946043f7a3d4385825fe:25847
Error: Can't find field resolutions({"id":"Resolution:DHSzPa8bvPCDjuAac"})
on object (ROOT_QUERY) {
"resolutions": [
{
"type": "id",
"id": "Resolution:AepgCCio9KWGkwyMC",
"generated": false
},
{
"type": "id",
"id": "Resolution:DHSzPa8bvPCDjuAac", // <==ID I'M SEEKING
"generated": false
}
],
"user": {
"type": "id",
"id": "User:WWv57KsvqWeAoBNHY",
"generated": false
}
}.
The object with that id appears to be plainly visible as the second entry in the list of resolutions.
Here's my query:
const GET_CURRENT_RESOLUTION_AND_GOALS = gql`
query Resolutions($id: String!) {
resolutions(id: $id) {
_id
name
completed
goals {
_id
name
completed
}
}
}
`;
...and here's how I'm calling it:
<Mutation
mutation={CREATE_GOAL}
update={(cache, {data: {createGoal}}) => {
let id = 'Resolution:' + resolutionId;
const {resolutions} = cache.readQuery({
query: GET_CURRENT_RESOLUTION_AND_GOALS,
variables: {
id
},
});
}}
>
What am I missing?
Update
Per the GraphQL Dev Tools extension for Chrome, here's the whole GraphQL data store:
{
"data": {
"resolutions": [
{
"_id": "AepgCCio9KWGkwyMC",
"name": "testing 123",
"completed": false,
"goals": [
{
"_id": "TXq4nvukpLcqQhMRL",
"name": "test goal abc",
"completed": false,
"__typename": "Goal"
},
],
"__typename": "Resolution"
},
{
"_id": "DHSzPa8bvPCDjuAac",
"name": "testing 345",
"completed": false,
"goals": [
{
"_id": "PEkg5oEEi2tJ6i8LH",
"name": "goal abc",
"completed": false,
"__typename": "Goal"
},
{
"_id": "X4H4dFzGm5gkq5bPE",
"name": "goal bcd",
"completed": false,
"__typename": "Goal"
},
{
"_id": "hYunrXsMq7Gme7Xck",
"name": "goal cde",
"completed": false,
"__typename": "Goal"
}
"__typename": "Resolution"
}
],
"user": {
"_id": "WWv57KsvqWeAoBNHY",
"__typename": "User"
}
}
}
Posted as answer for fellow apollo users with similar problems:
Remove the prefix of Resolution:, the query should only take the id.
Then the question arises how is your datastore filled?
To read a query from cache, the query needs to have been called with exactly the same arguments on the remote API before. This way apollo knows what the result for a field is with specific arguments. If you never called the remote endpoint with the arguments you want to use but know what the result would be, you can circumvent that and resolve the query locally by implementing a cache resolver. Have a look at the example in the documentation. Here the store contains a list of books (in your case resultions) and the query for a single book by id can be resolved with a simple cache lookup.

Children are not mapping properly in elastic to parents

"chods": {
"mappings": {
"chod": {
"properties": {
"state": {
"type": "text"
}
}
},
"chods": {},
"variant": {
"_parent": {
"type": "chod"
},
"_routing": {
"required": true
},
"properties": {
"percentage": {
"type": "double"
}
}
}
}
},
When I execute:
PUT /chods/variant/565?parent=36442
{ // some data }
It returns:
{
"_index":"chods",
"_type":"variant",
"_id":"565",
"_version":6,
"result":"updated",
"_shards":{
"total":2,
"successful":1,
"failed":0
},
"created":false
}
But when I run this query:
GET /chods/variant/565?parent=36442
It returns variant with parent=36443
{
"_index": "chods",
"_type": "variant",
"_id": "565",
"_version": 7,
"_routing": "36443",
"_parent": "36443",
"found": true,
"_source": {
...
}
}
Why it returns with parent 36443 and not 36442?
When I tried to reproduce this with your steps, I got the expected result (version=36442). I noticed that after your PUT of the document with "_parent": "36442" the output is "_version":6. In your GET of the document, "_version": 7 is returned. Is it possible that you posted another version of the document?
I also noticed that GET /chods/variant/565?parent=36443 would not actually filter by the parent id - the query parameter is disregarded. If you actually want to filter by parent id, this is the query you're looking for:
GET /chods/_search
{
"query": {
"parent_id": {
"type": "variant",
"id": "36442"
}
}
}
As #fylie pointed out the main problem is that if you use same id of the document you will get your document overridden by last version - sort of
Lets say that we have index /tests and type "a" which is child of type "test" and we do following commands:
PUT /tests/a/50?parent=25
{
"item": "C"
}
PUT /tests/a/50?parent=26
{
"item": "D"
}
PUT /tests/a/50?parent=50
{
"item": "E",
"item2": "F",
}
What the result will be? Well it can result in creating 1 - 3 documents.
If it will route to the same shard, you will end up with one document, which will have 3 versions.
If it will route to 3 different shards, you will end up with 3 new documents.

Elastic Search existing field mapping

I am new to elastic search, I have created an index "cmn" with a type "mention". I am trying to import data from my existing solr to elasticsearch, so I want to map an existing field to the _id field.
I have created the following file under /config/mappings/cmn/,
{
"mappings": {
"mentions":{
"_id" : {
"path" : "docKey"
}
}
}
}
But this doesn't seem to be working, every time I index a record the following _id is created,
"_index": "cmn",
"_type": "mentions",
"_id": "k4E0dJr6Re2Z39HAIjYMmg",
"_score": 1
Also, the mapping is not reflects. I have also tried the following option,
{
"mappings": {
"_id" : {
"path" : "docKey"
}
}
}
SAMPLE DOCUMENT: Basically a tweet.
{
"usrCreatedDate": "2012-01-24 21:34:47",
"sex": "U",
"listedCnt": 2,
"follCnt": 432,
"state": "Southampton",
"classified": 0,
"favCnt": 468,
"timeZone": "Casablanca",
"twitterId": 473333038,
"lang": "en",
"stnostem": "#ootd #ootw #fashion #styling #photography #white #pink #playsuit #prada #sunny #spring http://t.co/YbPFrXlpuh",
"sourceId": "tw",
"timestamp": "2014-04-09T22:58:00.396Z",
"sentiment": 0,
"updatedOnGMTDate": "2014-04-09T22:56:57.000Z",
"userLocation": "Southampton",
"age": 0,
"priorityScore": 57.4700012207031,
"statusCnt": 14612,
"name": "YazzyK",
"profilePicUrl": "http://pbs.twimg.com/profile_images/453578494556270594/orsA0pKi_normal.jpeg",
"mentions": "",
"sourceStripped": "Instagram",
"collectionName": "STREAMING",
"tags": "557/161/193/197",
"msgid": 1397084280396.33,
"_version_": 1464949081784713200,
"url2": "{\"urls\":[{\"url\":\"http://t.co/YbPFrXlpuh\",\"expandedURL\":\"http://instagram.com/p/mliZbgxVZm/\",\"displayURL\":\"instagram.com/p/mliZbgxVZm/\",\"start\":88,\"end\":110}]}",
"links": "http://t.co/YbPFrXlpuh",
"retweetedStatus": "",
"twtScreenName": "YazKader",
"postId": "454030232501358592",
"country": "Bermuda",
"message": "#ootd #ootw #fashion #styling #photography #white #pink #playsuit #prada #sunny #spring http://t.co/YbPFrXlpuh",
"source": "Instagram",
"parentStatusId": -1,
"bio": "Live and breathe Fashion. Persian and proud- Instagram: #Yazkader",
"createdOnGMTDate": "2014-04-09T22:56:57.000Z",
"searchText": "#ootd #ootw #fashion #styling #photography #white #pink #playsuit #prada #sunny #spring http://t.co/YbPFrXlpuh",
"isFavorited": "False",
"frenCnt": 214,
"docKey": "tw_454030232501358592"
}
Also, how can we create unique mapping for each "TYPE" and not just the index.
Thanks
Do like this,
Put the mapping as,
PUT index_name/type_name/_mapping
{
"type_name": {
"_id": {
"path": "docKey"
},
"properties": {
"docKey": {
"type": "string"
}
}
}
}
And, it will work. (When you index docKey, then _id is set). You shouldn't have to provide all the mapping.

Resources