What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment?

What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment? - elasticsearch

I am fairly familiar with the ELK stack and currently using Elastic search 6.6. Our use case is content search for about 40K .docx files
(uploaded by Portfolio managers as research reports.
Max file size allowed is 10 MB, but mostly file sizes are in few Kb).
I have used the ingest attachment plug in to index sample test files and I am able to also search the content using KIBANA
for ex: POST /attachment_test/my_type/_search?pretty=true
{
"query": {
"match": {
"attachment.content": "JP Morgan"
}
}
}
returns me the expected results.
My doubts:
Using the ingest plug in, we need to push data to the plug in. I am using VS 2017 and elastic NEST dll. Which means, I have to programmatically read the 40K documents and push them to ES using the NEST commands?
I have gone through the Fscrawler project and know that it can achieve the purpose but I am keeping it as my last resort
If I were to use approach 1 (code), is there any bulk upload API available for posting number of attachments together to ES (in batches)?

Finally, I uploaded 40K files in to the elastic index using C# code:
private static void PopulateIndex(ElasticClient client)
{
var directory =System.Configuration.ConfigurationManager.AppSettings["CallReportPath"].ToString();
var callReportsCollection = Directory.GetFiles(directory, "*.doc"); //this will fetch both doc and docx
//callReportsCollection.ToList().AddRange(Directory.GetFiles(directory, "*.doc"));
ConcurrentBag<string> reportsBag = new ConcurrentBag<string>(callReportsCollection);
int i = 0;
var callReportElasticDataSet = new DLCallReportSearch().GetCallReportDetailsForElastic();//.AsEnumerable();//.Take(50).CopyToDataTable();
try
{
Parallel.ForEach(reportsBag, callReport =>
//Array.ForEach(callReportsCollection,callReport=>
{
var base64File = Convert.ToBase64String(File.ReadAllBytes(callReport));
var fileSavedName = callReport.Replace(directory, "");
// var dt = dLCallReportSearch.GetCallFileName(fileSavedName.Replace("'", "''"));//replace the ' in a file name with '';
var rows = callReportElasticDataSet.Select("CALL_SAVE_FILE like '%" + fileSavedName.Replace("'", "''") + "'");
if (rows != null && rows.Count() > 0)
{
var row = rows.FirstOrDefault();
//foreach (DataRow row in rows)
//{
i++;
client.Index(new Document
{
Id = i,
DocId = Convert.ToInt32(row["CALL_ID"].ToString()),
Path = row["CALL_SAVE_FILE"].ToString().Replace(CallReportPath, ""),
Title = row["CALL_FILE"].ToString().Replace(CallReportPath, ""),
Author = row["USER_NAME"].ToString(),
DateOfMeeting = string.IsNullOrEmpty(row["CALL_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_DT"].ToString()),
Location = row["CALL_LOCATION"].ToString(),
UploadDate = string.IsNullOrEmpty(row["CALL_REPORT_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_REPORT_DT"].ToString()),
CompanyName = row["COMP_NAME"].ToString(),
CompanyId = Convert.ToInt32(row["COMP_ID"].ToString()),
Country = row["COU_NAME"].ToString(),
CountryCode = row["COU_CD"].ToString(),
RegionCode = row["REGION_CODE"].ToString(),
RegionName = row["REGION_NAME"].ToString(),
SectorCode = row["SECTOR_CD"].ToString(),
SectorName = row["SECTOR_NAME"].ToString(),
Content = base64File
}, p => p.Pipeline("attachments"));
//}
}
});
}
catch (Exception ex)
{
throw ex;
}
}

Related

EWS The server cannot service this request right now

I am seeing errors while exporting email in office 365 account using ews managed api, "The server cannot service this request right now. Try again later." Why is that error occurring and what can be done about it?
I am using the following code for that work:-
_GetEmail = (EmailMessage)item;
bool isread = _GetEmail.IsRead;
sub = _GetEmail.Subject;
fold = folder.DisplayName;
historicalDate = _GetEmail.DateTimeSent.Subtract(folder.Service.TimeZone.GetUtcOffset(_GetEmail.DateTimeSent));
props = new PropertySet(EmailMessageSchema.MimeContent);
var email = EmailMessage.Bind(_source, item.Id, props);
bytes = new byte[email.MimeContent.Content.Length];
fs = new MemoryStream(bytes, 0, email.MimeContent.Content.Length, true);
fs.Write(email.MimeContent.Content, 0, email.MimeContent.Content.Length);
Demail = new EmailMessage(_destination);
Demail.MimeContent = new MimeContent("UTF-8", bytes);
// 'SetExtendedProperty' used to maintain historical date of items
Demail.SetExtendedProperty(new ExtendedPropertyDefinition(57, MapiPropertyType.SystemTime), historicalDate);
// PR_MESSAGE_DELIVERY_TIME
Demail.SetExtendedProperty(new ExtendedPropertyDefinition(3590, MapiPropertyType.SystemTime), historicalDate);
if (isread == false)
{
Demail.IsRead = isread;
}
if (_source.RequestedServerVersion == flagVersion && _destination.RequestedServerVersion == flagVersion)
{
Demail.Flag = _GetEmail.Flag;
}
_lstdestmail.Add(Demail);
_objtask = new TaskStatu();
_objtask.TaskId = _taskid;
_objtask.SubTaskId = subtaskid;
_objtask.FolderId = Convert.ToInt64(folderId);
_objtask.SourceItemId = Convert.ToString(_GetEmail.InternetMessageId.ToString());
_objtask.DestinationEmail = Convert.ToString(_fromEmail);
_objtask.CreatedOn = DateTime.UtcNow;
_objtask.IsSubFolder = false;
_objtask.FolderName = fold;
_objdbcontext.TaskStatus.Add(_objtask);
try
{
if (counter == countGroup)
{
Demails = new EmailMessage(_destination);
Demails.Service.CreateItems(_lstdestmail, _destinationFolder.Id, MessageDisposition.SaveOnly, SendInvitationsMode.SendToNone);
_objdbcontext.SaveChanges();
counter = 0;
_lstdestmail.Clear();
}
}
catch (Exception ex)
{
ClouldErrorLog.CreateError(_taskid, subtaskid, ex.Message + GetLineNumber(ex, _taskid, subtaskid), CreateInnerException(sub, fold, historicalDate));
counter = 0;
_lstdestmail.Clear();
continue;
}
This error occurs only if try to export in office 365 accounts and works fine in case of outlook 2010, 2013, 2016 etc..

Usually this is the case when exceed the EWS throttling in Exchange. It is explain in here.
Make sure you already knew throttling policies and your code comply with them.
You can find throttling policies using Get-ThrottlingPolicy if you have the server.

One way to solve the throttling issue you are experiencing is to implement paging instead of requesting all items in one go. You can refer to this link.
For instance:
using Microsoft.Exchange.WebServices.Data;
static void PageSearchItems(ExchangeService service, WellKnownFolderName folder)
{
int pageSize = 5;
int offset = 0;
// Request one more item than your actual pageSize.
// This will be used to detect a change to the result
// set while paging.
ItemView view = new ItemView(pageSize + 1, offset);
view.PropertySet = new PropertySet(ItemSchema.Subject);
view.OrderBy.Add(ItemSchema.DateTimeReceived, SortDirection.Descending);
view.Traversal = ItemTraversal.Shallow;
bool moreItems = true;
ItemId anchorId = null;
while (moreItems)
{
try
{
FindItemsResults<Item> results = service.FindItems(folder, view);
moreItems = results.MoreAvailable;
if (moreItems && anchorId != null)
{
// Check the first result to make sure it matches
// the last result (anchor) from the previous page.
// If it doesn't, that means that something was added
// or deleted since you started the search.
if (results.Items.First<Item>().Id != anchorId)
{
Console.WriteLine("The collection has changed while paging. Some results may be missed.");
}
}
if (moreItems)
view.Offset += pageSize;
anchorId = results.Items.Last<Item>().Id;
// Because you’re including an additional item on the end of your results
// as an anchor, you don't want to display it.
// Set the number to loop as the smaller value between
// the number of items in the collection and the page size.
int displayCount = results.Items.Count > pageSize ? pageSize : results.Items.Count;
for (int i = 0; i < displayCount; i++)
{
Item item = results.Items[i];
Console.WriteLine("Subject: {0}", item.Subject);
Console.WriteLine("Id: {0}\n", item.Id.ToString());
}
}
catch (Exception ex)
{
Console.WriteLine("Exception while paging results: {0}", ex.Message);
}
}
}

NEST Elasticsearch Reindex examples

my objective is to reindex an index with 10 million shards for the purposes of changing field mappings to facilitate significant terms analysis.
My problem is that I am having trouble using the NEST library to perform a re-index, and the documentation is (very) limited. If possible I need an example of the following in use:
http://nest.azurewebsites.net/nest/search/scroll.html
http://nest.azurewebsites.net/nest/core/bulk.html

NEST provides a nice Reindex method you can use, although the documentation is lacking. I've used it in a very rough-and-ready fashion with this ad-hoc WinForms code.
private ElasticClient client;
private double count;
private void reindex_Completed()
{
MessageBox.Show("Done!");
}
private void reindex_Next(IReindexResponse<object> obj)
{
count += obj.BulkResponse.Items.Count();
var progress = 100 * count / (double)obj.SearchResponse.Total;
progressBar1.Value = (int)progress;
}
private void reindex_Error(Exception ex)
{
MessageBox.Show(ex.ToString());
}
private void button1_Click(object sender, EventArgs e)
{
count = 0;
var reindex = client.Reindex<object>(r => r.FromIndex(fromIndex.Text).NewIndexName(toIndex.Text).Scroll("10s"));
var o = new ReindexObserver<object>(onError: reindex_Error, onNext: reindex_Next, completed: reindex_Completed);
reindex.Subscribe(o);
}
And I've just found the blog post that showed me how to do it: http://thomasardal.com/elasticsearch-migrations-with-c-and-nest/

Unfortunately the NEST implementation is not quite what I expected. In my opinion it's a bit over-engineered for possibly the most common use case.
Alot of people just want to update their mappings with zero downtime...
In my case - I had already taken care of creating the index with all its settings and mappings, but NEST insists that it must create a new index when reindexing. That among many other things. Too many other things.
I found it much less complicated to just implement directly - since NEST already has Search, Scroll, and Bulk methods. (this is adopted from NEST's implementation):
// Assuming you have already created and setup the index yourself
public void Reindex(ElasticClient client, string aliasName, string currentIndexName, string nextIndexName)
{
Console.WriteLine("Reindexing documents to new index...");
var searchResult = client.Search<object>(s => s.Index(currentIndexName).AllTypes().From(0).Size(100).Query(q => q.MatchAll()).SearchType(SearchType.Scan).Scroll("2m"));
if (searchResult.Total <= 0)
{
Console.WriteLine("Existing index has no documents, nothing to reindex.");
}
else
{
var page = 0;
IBulkResponse bulkResponse = null;
do
{
var result = searchResult;
searchResult = client.Scroll<object>(s => s.Scroll("2m").ScrollId(result.ScrollId));
if (searchResult.Documents != null && searchResult.Documents.Any())
{
searchResult.ThrowOnError("reindex scroll " + page);
bulkResponse = client.Bulk(b =>
{
foreach (var hit in searchResult.Hits)
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
return b;
}).ThrowOnError("reindex page " + page);
Console.WriteLine("Reindexing progress: " + (page + 1) * 100);
}
++page;
}
while (searchResult.IsValid && bulkResponse != null && bulkResponse.IsValid && searchResult.Documents != null && searchResult.Documents.Any());
Console.WriteLine("Reindexing complete!");
}
Console.WriteLine("Updating alias to point to new index...");
client.Alias(a => a
.Add(aa => aa.Alias(aliasName).Index(nextIndexName))
.Remove(aa => aa.Alias(aliasName).Index(currentIndexName)));
// TODO: Don't forget to delete the old index if you want
}
And the ThrowOnError extension method in case you want it:
public static T ThrowOnError<T>(this T response, string actionDescription = null) where T : IResponse
{
if (!response.IsValid)
{
throw new CustomExceptionOfYourChoice(actionDescription == null ? string.Empty : "Failed to " + actionDescription + ": " + response.ServerError.Error);
}
return response;
}

I second Ben Wilde's answer above. Better to have full control over index creation and the re-index process.
What's missing from Ben's code is support for parent/child relationship. Here is my code to fix that:
Replace the following lines:
foreach (var hit in searchResult.Hits)
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
With this:
foreach (var hit in searchResult.Hits)
{
var jo = hit.Source as JObject;
JToken jt;
if(jo != null && jo.TryGetValue("parentId", out jt))
{
// Document is child-document => add parent reference
string parentId = (string)jt;
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id).Parent(parentId));
}
else
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
}

Dart PowerSNMP GetTable does not return any record

I am using dart powerSNMP for .Net.
I am trying to query a table using GetTable(), it does not work for me.
Below C# code does not return any row,
const string address = "xxx.xxx.xx.x";
using (var mgr = new Manager())
{
var slave = new ManagerSlave(mgr);
slave.Socket.ReceiveTimeout = 13000;
try
{
//Retrieve table using GetNext requests
Variable[,] table = slave.GetTable("1.3.6.1.4.1.14823.2.2.1.1.1.9",
SnmpVersion.Three,
null,
new Security()
{
AuthenticationPassword = "mypassword1",
AuthenticationProtocol = AuthenticationProtocol.Md5,
PrivacyPassword = "mypassword2",
PrivacyProtocol = PrivacyProtocol.Des
},
new IPEndPoint(IPAddress.Parse(address), 161),
0);
}catch(Exception ez)
{
}
}
This is supposed to return a set of records from given OID. but It does not return me anything. When I use MIB Browser, I see GetBulk operation fetches all the records for me.
But what is wrong with GetTable() here?

C# console App compare records to database

I am parsing a flat file which is working line by line and inserting into the database but I want to add an additional step, actually additional 2 steps.
First I only want records for recalls for specific car manufacturers which I have a database table called AutoMake that has a list of all the makes I want to include. I need to compare the record to that table to ensure that it is a record of one of the makes that I want to include.
Then I need to do a second check to make sure the record is not already in my database.
This is a console App and I am using Entity for this. So here is my code I am driving myself crazy trying to write and rewrite this to include the checks but im just not getting it.
Oh and not that it really matters because if someone can get me in the right direction I can move from there but tokens[2] is the MAKETXT and RCL_CMPT_ID is tokens[23] and RCL_CMPT_ID can be used to verify if a record is already in the database since it is a unique value
public static void ParseTSV(string location)
{
Console.WriteLine("Parsing.....");
using (var reader = new StreamReader(location))
{
var lines = reader.ReadToEnd().Split(new char[] { '\n' });
if (lines.Length > 0)
{
foreach (string line in lines)
{
if (string.IsNullOrWhiteSpace(line))
{
continue;
}
var tokens = line.Trim().Split(new char[] { '\t' });
var recalls = new Recalls();
recalls.RECORD_ID = tokens[0];
recalls.CAMPNO = tokens[1];
recalls.MAKETXT = tokens[2];
recalls.MODELTXT = tokens[3];
recalls.YEARTXT = tokens[4];
recalls.MFGCAMPNO = tokens[5];
recalls.COMPNAME = tokens[6];
recalls.MFGNAME = tokens[7];
recalls.BGMAN = tokens[8];
recalls.ENDMAN = tokens[9];
recalls.RCLTYPECD = tokens[10];
recalls.POTAFF = tokens[11];
recalls.ODATE = tokens[12];
recalls.INFLUENCED_BY = tokens[13];
recalls.MFGTXT = tokens[14];
recalls.RCDATE = tokens[15];
recalls.DATEA = tokens[16];
recalls.RPNO = tokens[17];
recalls.FMVSS = tokens[18];
recalls.DESC_DEFECT = tokens[19];
recalls.CONEQUENCE_DEFECT = tokens[20];
recalls.CORRECTIVE_ACTION = tokens[21];
recalls.NOTES = tokens[22];
recalls.RCL_CMPT_ID = tokens[23];
string connectionString = GetConnectionString();
using (SqlConnection connection = new SqlConnection(connectionString))
{
SqlCommand cmdIns = new SqlCommand(GetInsertSqlCust(recalls), connection);
connection.Open();
cmdIns.ExecuteNonQuery();
connection.Close();
cmdIns.Dispose();
cmdIns = null;
}
}
}
}
}

1:Get the ID to check against
2:Get the table where the search needs to be done like
string strExpression="";
(Datatable tdGeneric = dal.getsometable())
3:To check the auto make if exists:
if ( tdGeneric != null && tdGeneric.Rows.Count > 0)
{
strExpression = "tablecolumnsname = '" + recalls.MAKETXT + "' ";
tdGeneric.DefaultView.RowFilter = strExpression;
tdGeneric = tdGeneric.DefaultView.ToTable();
if (tdGeneric.Rows.Count > 0)
{
//make exist
}
else
make don't exists
}
else
{
make don't exist skip that text file's record
}
4: if make exist then check for a record if that exists in a table.
get the original table,search on that table for specific id in your case:
(Datatable tdGeneric2 = dal.getsometable())
if ( tdGeneric2 != null && tdGeneric.Rows.Count > 0)
{
strExpression = "tablecolumnsname = '" + recalls.RCL_CMPT_ID + "' ";
tdGeneric2.DefaultView.RowFilter = strExpression;
tdGeneric2 = tdGeneric2.DefaultView.ToTable();
if (tdGeneric2.Rows.Count > 0)
{
//record exist
}
else
record don't exists
}
else
{
record don't exist insert the record, or some flag to insert a record
}

You can take advantages of caching. Fetch all the Makes before reading the file and do a lookup in the List or Dictionary of existing AutoMakes to check if the AutoMake already exists in database or it is a new one. If the AutoMake is new, insert the record in database and also add the make in List\Dictionary. If AutoMake already exists, skip the line and move to next line.

lucene.net search multiple fields with one value AND other field with another value

I have a Lucene doc with various fields; Name, BriefData, FullData, ParentIDs (comma delimted string), ProductType, Experience.
I have a search form with a text box, drop down of parents, dropdown of product types, dropdown of experience.
If I search from the text box I get the results I should. If I search from any of dropdowns (or all of them) I get the results I want. If I use the dropdowns AND the textbox I get all results as a search of textbox OR dropdowns. What I want is textbox AND dropdowns.
So, my search builds something like so:
if (string.IsNullOrWhiteSpace(searchTerm))
{
searchTerm = "";
if (!string.IsNullOrWhiteSpace(Request.QueryString["textbox"]))
{
string tester = Request.QueryString["query"];
searchTerm += tester;
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["parent"]))
{
searchTerm += searchTerm.Length > 0 ? " " : "";
searchTerm += "+ParentIDs:" + Request.QueryString["parent"];
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["product"]))
{
ProductTypes pt = db.ProductTypes.Find(int.Parse(Request.QueryString["product"]));
if (pt != null) {
searchTerm += searchTerm.Length > 0 ? " " : "";
searchTerm += "+ProductType:" + pt.TypeName;
}
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["experience"]))
{
searchTerm += searchTerm.Length > 0 ? " " : "";
searchTerm += "+Experience:" + Request.QueryString["experience"];
}
}
if (!Directory.Exists(Helper.LuceneSearch._luceneDir))
Directory.CreateDirectory(Helper.LuceneSearch._luceneDir);
_searchResults = string.IsNullOrEmpty(searchField)
? Helper.LuceneSearch.Search(searchTerm).Distinct()
: Helper.LuceneSearch.Search(searchTerm, searchField).Distinct();
return View(_searchResults.Distinct());
If I am searching just textbox and dropdown parent I get a searchterm of "north +ParentIDs:62"
What I want is the search to ONLY return results with a parent of 62 AND (Name OR BriefData OR FullData of "north").
I have tried creating a searchTerm of "+(Name:north BriefData:north FullData:north) +ParentIDs:62" and "Name:north BriefData:north FullData:north +ParentIDs:62". The first returns no results and the second returns the same as just searching +ParentIDs:62.
I think the logic behind this is pretty simple. However, I have no idea what it is that I need to write in code.
Please help. :)

Thanks to JF Beaulac giving me cause to look at the Lucene.Net code I had included (Helper.LuceneSearch.Search(searchTerm).Distinct()) I rewrote my search to essentially not bother using that but instead to somewhat duplicate it.
I did this by using the MultiFieldQueryParser for the, oddly enough, multi-field search I wanted. I then used the TermQuery for single field queries. These were all added to a BooleanQuery and my search was executed against said BooleanQuery.
var hits_limit = 1000;
var analyzer = new StandardAnalyzer(Version.LUCENE_29);
BooleanQuery bq = new BooleanQuery();
if (string.IsNullOrWhiteSpace(searchTerm))
{
searchTerm = "";
if (!string.IsNullOrWhiteSpace(Request.QueryString["textbox"]))
{
string tester = Request.QueryString["textbox"];
var parser = new MultiFieldQueryParser(Version.LUCENE_29, new[] { "Name", "BriefData", "FullData" }, analyzer);
var query = Helper.LuceneSearch.parseQuery(tester.Replace("*", "").Replace("?", ""), parser);
bq.Add(query, BooleanClause.Occur.MUST);
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["parent"]))
{
bq.Add(new TermQuery(new Term("ParentIDs", Request.QueryString["parent"])), BooleanClause.Occur.MUST);
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["product"]))
{
ProductTypes pt = db.ProductTypes.Find(int.Parse(Request.QueryString["product"]));
if (pt != null) {
bq.Add(new TermQuery(new Term("ProductType", pt.TypeName)), BooleanClause.Occur.MUST);
}
}
if (!string.IsNullOrWhiteSpace(Request.QueryString["experience"]))
{
bq.Add(new TermQuery(new Term("Experience", Request.QueryString["experience"])), BooleanClause.Occur.MUST);
}
}
if (!System.IO.Directory.Exists(Helper.LuceneSearch._luceneDir))
System.IO.Directory.CreateDirectory(Helper.LuceneSearch._luceneDir);
var searcher = new IndexSearcher(Helper.LuceneSearch._directory, false);
var hits = searcher.Search(bq, null, hits_limit, Sort.RELEVANCE).ScoreDocs;
var results = Helper.LuceneSearch._mapLuceneToDataList(hits, searcher).Distinct();
analyzer.Close();
searcher.Close();
searcher.Dispose();
return View(results);
It should be noted that to get the product and experience fields to work I had to set them to "Field.Index.NOT_ANALYZED" when adding them to the index. I'm guessing this was because they would only ever have a single value per document. The other searched fields are "Field.Index.ANALYZED".

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment? - elasticsearch

Related

EWS The server cannot service this request right now

NEST Elasticsearch Reindex examples

Dart PowerSNMP GetTable does not return any record

C# console App compare records to database

lucene.net search multiple fields with one value AND other field with another value

Categories

Resources