My Elasticsearch index has more than 1000 fields due to my Sql schema and I get below exception:
{'type': 'illegal_argument_exception', 'reason': 'Limit of total
fields [1000] in index }
And my bulk insert looks like this:
with open('audit1.txt') as file:
for line in file:
columns = line.split(r'||')
dict['TimeStamp']=columns[0].strip('\'')
dict['BusinessTimeStamp']=columns[1].strip('\'')
dict['RuntimeMicroflowID']=columns[2].strip('\'')
dict['MicroflowID']=columns[3].strip('\'')
dict['UserId']=columns[4].strip('\'')
dict['ClientId']=columns[5].strip('\'')
dict['Userlocation']=columns[6].strip('\'')
dict['Transactionid']=columns[7].strip('\'')
dict['Catagorie']=columns[8].strip('\'')
dict['EventType']=columns[9].strip('\'')
dict['Operation']=columns[10].strip('\'')
dict['PrimaryData']=columns[11].strip('\'')
dict['SecondayData']=columns[12].strip('\'')
i=13
while i < len(columns):
tempdict['BFOLDVALUE'] = columns[i+1].strip('\'')
tempdict['BFNEWVALUE'] = columns[i+2].strip('\'')
if columns[i].strip('\'') is not None:
dict[columns[i].strip('\'')] = tempdict.copy()
i+=3
tempdict.clear()
#print(json.dumps(dict,indent = 4))
batch.append(dict)
if counter==BATCHSIZE:
try:
helpers.bulk(es, batch, index='audit-index', doc_type='audit')
insertedrecords+=counter
counter = 0
batch.clear()
print(insertedrecords," - Records Has Been inserted ")
except BulkIndexError:
print("Error Occured -- continuing")
print(json.dumps(dict,indent = 4))
print(BulkIndexError)
batch.clear()
break
counter+=1
dict.clear()
So, I am assuming I am trying to index this wrongly... is there a better way of indexing this kind of formats in elasticsearch? Note than I am using ELK version 7.5.
Here is the sample file I am parsing to elasticsearch:
2018.07.17/15:41:53.735||2018.07.17/15:41:53.735||'0164a8424fbbp84h%2139165'||'BT_TTB_CashDep_PRC'||'eskedarz'||'UXP'||'00001039'||'0164a842e519pJpA'||'Persistence'||''||'CREATE'||'DailyTxns'||'0164a842e4eapJnu'||'CurrentThread'||'WebContainer : 15'||''||'ParentThread'||'system'||''||'TCPWorkerThreadID'||'WebContainer : 15'||''||'f_POSTINGDT'||'2018-07-17'||''||'versionNum'||'0'||''||'f_TXNAMTDR'||'0'||''||'f_ACCOUNTID'||'013XXXXXXXXX0'||''||'f_VALUEDTTM'||'2018-07-17 15:41:53.0'||''||'f_POSTINGDTTM'||'2018-07-17 15:41:53.692'||''||'f_TXNCLBAL'||'25551.610000'||''||'f_TXNREF'||'0000103917071815410685326'||''||'f_PIEVENTTYPE'||'N'||''||'f_TXNAMT'||'5000.00'||''||'f_TRANSACTIONID'||'0164a842e4e9pJng'||''||'f_TYPE'||'N'||''||'f_USERID'||'xxxarz'||''||'f_SRNO'||'1'||''||'f_TXNBASEEQ'||'5000.00'||''||'f_TXNSRCBRANCH'||'0000X039'||''||'f_TXNCODE'||'T08'||''||'f_CHANNELID'||'BranchTeller'||''||'f_TXNAMTCR'||'5000.00'||''||'f_TXNNARRATION'||'SELF '||''||'f_ISACCRUALPENDING'||'false'||''||'f_TXNDTTM'||'2018-07-17 15:41:53.689'||''
if you carefully look at this part of the error message it would be clear.
Limit of total fields [1000] in index
1000 is the default limit of total fields in the Elasticsearch index as shown in their source code.
public static final Setting<Long> INDEX_MAPPING_TOTAL_FIELDS_LIMIT_SETTING =
Setting.longSetting("index.mapping.total_fields.limit", 1000L, 0, Property.Dynamic, Property.IndexScope);
Please note this is a dynamic setting, hence can be changed on a given index, by updating index setting
PUT test_index/_settings
{
"index.mapping.total_fields.limit": 1500. --> changed it to what is suitable for your index.
}
More info on this issue can be found here and here.
better way to handle such exploding index is to normalize as RDBMS that means store some of the key : value combinations in a nested structure
example
{"keyA":"ValueA","keyB":"ValueB","keyC":"ValueC"...} - record to
{"keyA":"ValueA","Keyvalue":{"keyB":"ValueB"
"keyC":"ValueC"}} - record
so searching would look like Keyvalue.Value == KeyB and KeyValue.Value = ValueB
Related
Problem description: I am trying to update documents in elastic search. I have data in a dataframe. Dataframe has esIndex, id and body attributes. I need to use the value in index attribute to update the document in the right index. The following code below does not work. What is possible fix?
Command 1
val esUrl = "test.elasticsearch.com"
Command 2
import spark.implicits._
val df = Seq(
("test_datbricks001","1221866122238136328", "TEST DATA 001"),
("test_datbricks001","1221866122238136329", "TEST DATA 002"),
("test_datbrick","1221866122238136329", "TEST DATA 002")
).toDF("esIndex", "id", "body")
Command 3
display(df)
Command 4
import org.apache.spark.sql.functions.col
val retval = df.toDF.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.mapping.id","id")
.option("es.mapping.exclude", "id, esIndex")
.option("es.port","80")
.option("es.nodes", esUrl)
.option("es.write.operation", "update")
.option("es.index.auto.create", "no")
.mode("Append")
.save("{esIndex}") // <<== This index is not taking value from "index" column of df
I am getting the following error EsHadoopIllegalArgumentException: Target index [{esIndex}] does not exist and auto-creation is disabled [setting 'es.index.auto.create' is 'false']
Question: How do I get code in Command 4 to use the value in esIndex column of df
I'm using Elasticsearch 6.8 and python 3.
I'm running on my laptop with 1 node, and there are no threads/processes that insert/update/delete docs into the index while I'm running the multi search
I'm running the following multi search command:
es = Elasticsearch()
search_arr = []
# search-1
search_arr.append({'index': 'test1', 'type': 'type1'})
search_arr.append({"query": {"term": {"confidence": "1"}}})
# search-2
search_arr.append({'index': 'test1', 'type': 'type1'})
search_arr.append({"query": {"match_all": {}}, 'from': 0, 'size': 2})
request = ''
for each in search_arr:
request += '%s \n' % json.dumps(each)
res = es.msearch(body=request)
print("First Query, num of results = ", res['responses'][0]['hits']['total'])
print("Second Query, num of results = ", res['responses'][1]['hits']['total'])
Each time I run this code, I'm getting different results (as I wrote before, there are no processes that insert/delete/update documents)
Why I'm getting different result each time ?
And what I need to do in order to fix it and get consistent results ?
I found the problem cause.
I needed to add refresh = True after the first time I added new data.
(I bulked thousands of new documents and right after use the multi search).
I wrote some code in spark as follows:
val df = sqlContext.read.json("s3n://blah/blah.gz").repartition(200)
val newdf = df.select("KUID", "XFF", "TS","UA").groupBy("KUID", "XFF","UA").agg(max(df("TS")) as "TS" ).filter(!(df("UA")===""))
val dfUdf = udf((z: String) => {
val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser();
val readableua = parser.parse(z)
Array(readableua.getName,readableua.getOperatingSystem.getName,readableua.getDeviceCategory.getName)
})
val df1 = newdf.withColumn("useragent", dfUdf(col("UA"))) ---PROBLEM LINE 1
val df2= df1.map {
case org.apache.spark.sql.Row(col1:String,col2:String,col3:String,col4:String, col5: scala.collection.mutable.WrappedArray[String]) => (col1,col2,col3,col4, col5(0), col5(1), col5(2))
}.toDF("KUID", "XFF","UA","TS","browser", "os", "device")
val dataset =df2.dropDuplicates(Seq("KUID")).drop("UA")
val mobile = dataset.filter(dataset("device")=== "Smartphone" || dataset("device") === "Tablet" ).
mobile.write.format("com.databricks.spark.csv").save("s3n://blah/blah.csv")
Here is a sample of the input data
{"TS":"1461762084","XFF":"85.255.235.31","IP":"10.75.137.217","KUID":"JilBNVgx","UA":"Flixster/1066 CFNetwork/758.3.15 Darwin/15.4.0" }
So in the above code snippet, i am reading a gz file of 2.4GB size. The read is taking 9minutes.The i group by ID and take the max timestamp.However(at PROBLEM LINE 1) the line which adds a column(with Column) is taking 2 hours.This line takes a User Agent and tries to derive OS,Device, Broswer info. Is this the wrong way to do things here.
I am running this on 4 node AWS cluster with r3.4xlarge ( 8 cores and 122Gb memory) with the following configuration
--executor-memory 30G --num-executors 9 --executor-cores 5
The problem here is that gzip is not splittable, and cannot be read in parallel. What happens in the background is that a single process will download the file from the bucket and then it will repartition it to distribute the data across the cluster. Please re-encode the input data to a splittable format. If the input file does not change a lot, you could for example consider bzip2 (because encoding is quite expensive and might take some time).
Update: Picking up answer from Roberto and sticking it here for the benefit of all
You are creating a new parser for every row within the UDF : val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser(); . It's probably expensive to construct it, you should construct one outside the UDF and use it as a closure
I want to load a CSV (just comma separated) file into my Hbase table. I already tried it with help of some googled articles, now just I am able to load entire row (or line) as value into Hbase, i.e. all values in single row are getting stored as single column, but I want to split the row based on delimiter comma (,) and store those vales into different columns in Hbase table's column family.
Please help to solve my issue. Any suggestions are appreciated.
Following are my present using input file, agent configuration file and hbase output files.
1)input file
8600000US00601,00601,006015-DigitZCTA,0063-DigitZCTA,11102
8600000US00602,00602,006025-DigitZCTA,0063-DigitZCTA,12869
8600000US00603,00603,006035-DigitZCTA,0063-DigitZCTA,12423
8600000US00604,00604,006045-DigitZCTA,0063-DigitZCTA,33548
8600000US00606,00606,006065-DigitZCTA,0063-DigitZCTA,10603
2)agent configuration file
agent.sources = spool
agent.channels = fileChannel2
agent.sinks = sink2
agent.sources.spool.type = spooldir
agent.sources.spool.spoolDir = /home/cloudera/Desktop/flume
agent.sources.spool.fileSuffix = .completed
agent.sources.spool.channels = fileChannel2
#agent.sources.spool.deletePolicy = immediate
agent.sinks.sink2.type = org.apache.flume.sink.hbase.HBaseSink
agent.sinks.sink2.channel = fileChannel2
agent.sinks.sink2.table = sample
agent.sinks.sink2.columnFamily = s1
agent.sinks.sink2.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
agent.sinks.sink1.serializer.regex = "\"([^\"]+)\""
agent.sinks.sink2.serializer.regexIgnoreCase = true
agent.sinks.sink1.serializer.colNames =col1,col2,col3,col4,col5
agent.sinks.sink2.batchSize = 100
agent.channels.fileChannel2.type=memory
3)HBase output
hbase(main):009:0> scan 'sample'
ROW COLUMN+CELL
1431064328720-0LalKGmSf3-1 column=s1:payload, timestamp=1431064335428, value=8600000US00602,00602,006025-DigitZCTA,0063-DigitZCTA,12869
1431064328720-0LalKGmSf3-2 column=s1:payload, timestamp=1431064335428, value=8600000US00603,00603,006035-DigitZCTA,0063-DigitZCTA,12423
1431064328720-0LalKGmSf3-3 column=s1:payload, timestamp=1431064335428, value=8600000US00604,00604,006045-DigitZCTA,0063-DigitZCTA,33548
1431064328721-0LalKGmSf3-4 column=s1:payload, timestamp=1431064335428, value=8600000US00606,00606,006065-DigitZCTA,0063-DigitZCTA,10603
4 row(s) in 0.0570 seconds
hbase(main):010:0>
I've been using Sphinx successfully for a while, but just ran into an issue that's got me confused... I back Sphinx with mysql queries and recently migrated my primary key strategy in a way that had the ids of the tables I'm indexing grow larger than 32 bits (in MYSQL they're bigint unsigned). Sphinx was getting index hits, but returning me nonsense ids (presumably 32 bits of the id returned by the queries or something)..
I looked into it, and realized I hadn't passed the --enable-id64 flag to ./configure. No problem, completely rebuilt sphinx with that flag (I'm running 0.9.9 by the way). No change though! I'm still experiencing the exact same issue. My test scenario is pretty simple:
MySQL:
create table test_sphinx(id bigint unsigned primary key, text varchar(200));
insert into test_sphinx values (10102374447, 'Girls Love Justin Beiber');
insert into test_sphinx values (500, 'But Small Ids are working?');
Sphinx conf:
source new_proof
{
type = mysql
sql_host = 127.0.0.1
sql_user = root
sql_pass = password
sql_db = testdb
sql_port =
sql_query_pre =
sql_query_post =
sql_query = SELECT id, text FROM test_sphinx
sql_query_info = SELECT * FROM `test_sphinx` WHERE `id` = $id
sql_attr_bigint = id
}
index new_proof
{
source = new_proof
path = /usr/local/sphinx/var/data/new_proof
docinfo = extern
morphology = none
stopwords =
min_word_len = 1
charset_type = utf-8
enable_star = 1
min_prefix_len = 0
min_infix_len = 2
}
Searching:
→ search -i new_proof beiber
Sphinx 0.9.9-release (r2117)
...
index 'new_proof': query 'beiber ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=1512439855, weight=1
(document not found in db)
words:
1. 'beiber': 1 documents, 1 hits
→ search -i new_proof small
Sphinx 0.9.9-release (r2117)
...
index 'new_proof': query 'small ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=500, weight=1
id=500
text=But Small Ids are working?
words:
1. 'small': 1 documents, 1 hits
Anyone have an idea about why this is broken?
Thanks in advance
-Phill
EDIT
Ah. Okay, got further. I didn't mention that I've been doing all of this testing on Mac OS. It looks like that may be my problem. I just compiled in 64 bit on linux and it works great.. There's also a clue when I run the Sphinx command line commands that the compile didn't take:
My Mac (broken)
Sphinx 0.9.9-release (r2117)
Linux box (working)
Sphinx 0.9.9-id64-release (r2117)
So I guess the new question is what's the trick to compiling for 64 bit keys on mac os?
Did you rebuild the index with 64 bits indexer?