Tokenizing a multi-language text field in Elasticsearch

Tokenizing a multi-language text field in Elasticsearch - elasticsearch

I have the following table which contains millions of documents data in the form of a json file:
+-------+---------------------------------------+------------+
| doc_id| doc_text | doc_lang |
+-------+---------------------------------------+------------+
| doc1 | "first /resource X 'title' " | en |
| doc2 | "<r>ressource 2 #titre en France" | Fr |
| doc3 | "die Tür geöffnet?" | ge |
| doc4 | "$risorsa 4 <in> lingua italiana" | It |
| ... | " ........." | .. |
| ... | "........." | .. |
+-------+---------------------------------------+------------+
I need to do the following:
Tokenizing, filtering and stopwords removing for each document text using an appropriate analyzer (dynamically) according to the text language shown in doc_lang field (let's say European languages).
Getting TF and IDF for each term inside doc_text field.(no search operations are required, just for scoring)
Q) Could anybody advice me if Elasticsearch is a good choice in this case?
P.S. I am looking for something compatible with Apache Spark.

Include language code in the doc_text field when indexing like
{ "doc_id": "doc", "doc_text_en": "xxx", "doc_lang": "en"}
Then you will be able to specify dynamic mapping of lang-specific analyzer.
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

Related

How can i merge multiple columns from two different files in talend

Lets say i have multiple columns coming from two different files like that :
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
| | |
And another one like this :
USERNAME | AGE |
Jonathan | 33 |
Mike | 41 |
And i want to merge the data of the columns that have the same name into one like this while keeping the data of the columns that are unique at each field:
USERNAME | AGE | GENDER | CHILDREN
Joe | 23 | male | 2
Annie | 45 | female | 5
Jonathan | 33 | |
Mike | 41 | |
Sorry if the answer is obvious, im new to talend, thanks.

What tool is available toy you?
The Append function in SAS for example can do this for you.
You can use the append approach in Python, R or other language you intend using.
For Talen:
Copy the complete subjob1 – copy me sub job and paste it to create a second sub job.
Link the two sub jobs using an onSubjobOK link.
Open tFixedFlowInput, and change Records from first subjob to Records from second subjob.
Open tFileOutputDelimited on the new sub job, and tick Append, as shown in the following screenshot:

use a tUnite component to accomplish that
here is the link of the documentation : https://help.talend.com/r/fr-FR/8.0/orchestration/tunite
your flow would be
tFileInput1(excel or csv ) ----------------------------------------------
|
| ->tUnite -> tLogRow
tFileInput2(excel or csv )->tMap (add to empty fields GENDER & Children )|

Missing results after reducing the visualization size

I would like to count the same log messages in Kibana. With the Size set to 200, it turns out that there are two results that happened twice
But, if I lower the Size to 5, I don't see those two:
It should show me top 5 rows, ordered by count. I expected something like this:
| LogMessage | Count |
|------------|-------|
| xx | 2 |
| yy | 2 |
| zz | 1 |
| qq | 1 |
| ww | 1 |
What am I missing?

The issue is the little warning about Analyzed Field. You should use a keyword field.
With analyzed fields, the analyzer breaks down the original string during indexing into sub-strings to facilitate search use cases (handling things like word boundaries, punctuation, case insensitivity, declination, etc)
A keyword field is just a simple string.
What's probably happening is that you have data like
| LogMessage | Count |
|------------|-------|
| a | 1 |
| b | 1 |
| c x | 1 |
| d x | 1 |
With an analyzed field, if you have a terms agg of size 2 you might (depending on the sort order) get a and b
With a larger terms agg, the top sub-string will be x
This is a simplified example, but I hope it gets the issue across.
The Terms Aggregation docs have a good section about how to avoid/solve this issue.

<blockquote> tag inserted when using image in cell of RST table?

When I use the following code:
+----------------------+---------------+---------------------------------------------------------------------+
| A | B | C |
+======================+===============+=====================================================================+
| Merchant Rating | Ad Extension | Star ratings plus number of reviews for the advertiser/merchant. |
| | | |
| | |.. image:: /images/merchant-rating.png |
+----------------------+---------------+---------------------------------------------------------------------+
The text preceding the image in column C gets wrapped in <blockquote> tags in the HTML output. Is there any way to avoid this?

To avoid the blockquote tag in the first paragraph of the third column, you could try using this:
+----------------------+---------------+---------------------------------------------------------------------+
| A | B | C |
+======================+===============+=====================================================================+
| Merchant Rating | Ad Extension | Star ratings plus number of reviews for the advertiser/merchant. |
| | | |
| | | |img| |
+----------------------+---------------+---------------------------------------------------------------------+
.. |img| image:: /images/merchant-rating.png
Instead, you'll get two paragraphs.

Use a substitution and remove the separating line so that Sphinx interprets the content as a single block of text.
+-----------------+--------------+------------------------------------------------------------------+
| A | B | C |
+=================+==============+==================================================================+
| Merchant Rating | Ad Extension | Star ratings plus number of reviews for the advertiser/merchant. |
| | | |img| |
+-----------------+--------------+------------------------------------------------------------------+
.. |img| image:: /images/merchant-rating.png

Efficient way to join by levenshtein in Hive or Impala

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?

Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

JasperReports: exporting the report to csv format

I am working with JasperReports 4.5.0, Spring 3.0.5 RELEASE. I am exporting my JR report in pdf, html, csv formats. With pdf, html the report is generating fine.
But when i am exporting my report in csv it is displaying all the fields and values in one column only. My code flow is exactly like this link.
Below is the example how I am getting now.
| A | B | C | D |
|S.No,IPAddress,TotalDuration,TotalBdrCount | | | |
|1,null,266082,null | | | |
2,null,null,null | | | |
3,null,null,null | | | |
4,null,null,null | | | |
Where S.No,IPAddress,TotalDuration,TotalBdrCount are the column headers and 1,null,266082,null are the values to the respective columns.
But my requirement is
| A | B | C | D |
| S.No | IPAddress | TotalDuration | TotalBdrCount |
I think you understood my problem. For this am i need to set any parameters? I am not getting. Can any one help me out regarding this issue.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Tokenizing a multi-language text field in Elasticsearch - elasticsearch

Include language code in the doc_text field when indexing like { "doc_id": "doc", "doc_text_en": "xxx", "doc_lang": "en"} Then you will be able to specify dynamic mapping of lang-specific analyzer. https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

Related

How can i merge multiple columns from two different files in talend

Missing results after reducing the visualization size

<blockquote> tag inserted when using image in cell of RST table?

Efficient way to join by levenshtein in Hive or Impala

JasperReports: exporting the report to csv format

Categories

Resources