How to index dump of html files to elasticsearch? - elasticsearch

I am totaly new in elastic so my knowledge is only from elasticsearch site and I need to help.
My task is to index large row data in html format into elastic search. I already crawled my data and stored it onto disk (200 000 html files). My question is what is the simplest way to index all html files into elasticsearch? Should I do it manualy by for each document to make put request to elastic? For example like:
curl -XPUT 'http://localhost:9200/registers/tomas/1' -d '{
"user" : "tomasko",
"post_date" : "2009-11-15T14:12:12",
"field 1" : "field data"
"field 2" : "field 2 data"
}'
And second question is if I have to parse HTML document to retrieve data for JSON field 1 like in example code over?
And finaly after indexing may I delete all HTML documents? Thanks for all.

I'd look at the bulk api that allows you to send more than document in a single request, in order to speed up your indexing process. You can send batch of 10, 20 or more documents, depending on how big they are.
Depending on what you want to index you might need to parse the html, unless you want to index the whole html as a single field (you might want to use the html strip char filter in that case to strip out the html tags from the indexed text).
After indexing I'd suggest to make sure the mapping is correct and you can find what you're looking for. You can always reindex using the _source special field that elasticsearch stores under the hood, but if you already wrote your indexer code you might want to use it again to reindex when needed (of course with the same html documents). In practice, you never index your data once... so be careful :) even though elasticsearch always helps you out with the _source field), it's just a matter of querying the existing index and reindex all its documents on another index.

#javanna's suggestion to look at the Bulk API will definitely lead you in the right direction. If you are using NEST, you can store all your objects in a list which you can then serialize JSON objects for indexing the content.
Specifically, if you want to strip the html tags out prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
You'll need to result in this mapping:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "
]
}

Related

What does "mappings" do in Elasticsearch?

I just started learning Elasticsearch. I am trying out to create index, adding data, deleting data, and search data.
I can also understand the settings of Elasticsearch.
When using "PUT" to use settings
{
"settings": {
"index.number_of_shards" : 1,
"index.number_of_replicas" : 0
}
}
When using "GET" to retrieve settings information
{
"dsm" : {
"settings" : {
"index" : {
"creation_date" : "1555487684262",
"number_of_shards" : "1",
"number_of_replicas" : "0",
"uuid" : "qsSr69OdTuugP2DUwrMh4g",
"version" : {
"created" : "7000099"
},
"provided_name" : "dsm"
}
}
}
}
However,
What does "mappings" do in Elasticsearch?
{
"kibana_sample_data_flights" : {
"aliases" : { },
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
}, // just part of data
The mapping document is a way of describing the structure of your data and defining the types eg boolean, text, keyword. These types are important as they determine how your fields are indexed and analysed.
Elasticsearch supports dynamic mapping, so effectively performs an automatic best guess of the appropriate types but you may wish to override these.
I found this to be a useful article to explain the mapping process:
https://www.elastic.co/blog/found-elasticsearch-mapping-introduction
Indexing is determined by the field type for example where the type is 'keyword' the search engine will be expecting an exact match, when the type is 'text' the search engine will be trying to determine how well the document matches the query term and in so doing so will be performing a 'full text search'.
So for example:
- A search for jump should also match jumped, jumps, jumping, and perhaps even leap.
This is a great article describing exact vs full text search and is where I took the jump example: https://www.elastic.co/guide/en/elasticsearch/guide/current/_exact_values_versus_full_text.html
Much of the power of elasticsearch is in the mapping and analysis.
Its the mapping of the index. This means it describes the data that is stored in this index. Take a deeper look here.

time-based when configure an index pattern not working

Hi!
I have an issue about set a date field as time-based when I configure my index pattern. When I choose my date filed on the timefield name, I cannot Vizualise any data on the Discover part.
However, when I uncheck the box named Index contains time-based events, all data appears:
Maybe I forgot something during my mapping ? There is the mapping I've set for this index:
"index_test" : {
"mappings": {
"tr": {
"_source": {
"enabled":true
},
"properties" : {
"id" : { "type" : "integer" },
"volume" : { "type" : "integer" },
"high" : { "type" : "float" },
"low" : { "type" : "float" },
"timestamp" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }
}
}
}'
}
I am currently try to use timelion also, and it seems to not found any data to show. I think it cannot because of this time-based unchecked... Any idea about how set this timestamp as time-based without loose the data access on the Discover part ?
Simple question with simple answer... I just forgot to set the timepicker in the Right-top of the Discover part to show past data:

How to add default values while adding a new field in existing mapping in elasticsearch

This is my existing mapping in elastic search for one of the child document
sessions" : {
"_routing" : {
"required" : true
},
"properties" : {
"operatingSystem" : {
"index" : "not_analyzed",
"type" : "string"
},
"eventDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"durations" : {
"type" : "integer"
},
"manufacturer" : {
"index" : "not_analyzed",
"type" : "string"
},
"deviceModel" : {
"index" : "not_analyzed",
"type" : "string"
},
"applicationId" : {
"type" : "integer"
},
"deviceId" : {
"type" : "string"
}
},
"_parent" : {
"type" : "userinfo"
}
}
in above mapping "durations" field is an integer array. I need to update the existing mapping by adding a new field called "durationCount" whose default value should be the size of durations array.
PUT sessions/_mapping
{
"properties" : {
"sessionCount" : {
"type" : "integer"
}
}
}
using above mapping I am able to update the existing mapping but I am not able to figure out how to assign a value ( which would vary for each session document like it should be durations array size ) while updating the mapping. any ideas ?
Well 2 recommendations here -
Instead of adding default value , you can adjust it in the query using missing filter. Lets say , you want to search based on a match query - Instead of just match query , use a bool query with should clause having the match and missing filter. inside filtered query. This way , those documents which did not have the field is also accounted.
If you absolutely need the value in that field for existing documents , you need to reindex the whole set of documents. Or , use the out of box plugin , update by query -

In elasticsearch, how important is it to fully define a mapping during mapping-creation?

I am creating a mapping like this
"institution" : {
"properties" : {
"InstitutionCode" : {
"type" : "string",
"store" : "yes"
},
"InstitutionID" : {
"type" : "integer",
"store" : "yes"
},
"Name" : {
"type" : "string",
"store" : "yes"
}
}
}
However, when I perform actual indexing operations for institutions, I am adding an Alias property (0 or more aliases per institution)
"institution" : {
"properties" : {
"Aliases" : {
"dynamic" : "true",
"properties" : {
"InstitutionAlias" : {
"type" : "string"
},
"InstitutionAliasTypeID" : {
"type" : "long"
}
}
},
"InstitutionCode" : {
"type" : "string",
"store" : "yes"
},
"InstitutionID" : {
"type" : "integer",
"store" : "yes"
},
"Name" : {
"type" : "string",
"store" : "yes"
}
}
}
This is actually a simplified example, as I am actually adding more fields than just Aliases during the actual indexing of records.
How important is it to to fully define a mapping during mapping-creation?
Am I going to suffer any penalties by having the mapping automatically adjusted during indexing operations due to the indexing of institution records with additional properties? I expect institutions to gain additional properties over time and I wonder if I need to maintain the mapping-creation code in addition to the institution-indexing code.
I believe the overhead of dynamic mapping is fairly negligible...using them won't hurt indexing speed. However, you can run into some unexpected situations where ElasticSearch auto-detects a field type incorrectly.
A common example is detecting an integer because the first example of a field is a number ("25"), when in reality the rest of the data for that field is a string. Or seeing an integer when the rest of the data is actually a float. Etc etc.
If your data is well standardized that isn't much of a problem.
Alternatively, you can use dynamic templates to apply mappings to new fields based on a regex pattern.

Storing only selected fields and not storing _all in pyes/elasticsearch

I am trying to use pyes with elasticsearch as full text search engine, I store only UUIDs and indexes of string fields, actual data is stored in MonogDB and retrieved using UUIDs. Unfortunately, I am unable to create a mapping that wouldn't store original data, I've tried various combinations of "store"/"source" fields and disabling "_all" but I can still get text of indexed fields. It seems that documentation is misleading on this topic as it's just a copy of original docs.
Can anyone please provide an example of mapping that would only store some fields and not the original document JSON?
Sure, you could use something like this (with two fields, 'uuid' and 'body'):
{
"mytype" : {
"_source" : {
"enabled" : false
},
"_all" : {
"enabled" : false
},
"properties" : {
"data" : {
"store" : "no",
"type" : "string"
},
"uuid" : {
"store" : "yes",
"type" : "string",
"index" : "not_analyzed"
}
}
}
}

Resources