Elastic Architecture Full Text Searching on 1 million file's content - elasticsearch

Summary
I am trying to design an elastic index(s) that will provide a solid foundation for indexing 1,000,000+ Files and full text searching on the contents. New files will be continuously added after the initial digitization process.
Use Case
Various File Types (Pdf, outlook email, mp3, txt, jpeg of handwritten things, ..etc) need to be searchable by their contents and meta-data. Users want to manually tag relationships between documents. ex Document A -> contains information about -> Document B. Users want to be able to see related/similar texts. Users want Named Entity Recognition on the text contents. The physical files are already stored on an external computer just waiting to be processed.
Implementation
File Content extraction pipeline using Apache Tika
NER using Spacy
Upload File Contents + NER Tags to Elastic
Eventually we would run our own search models to gain better search insights + data science.
How do I best store my extracted contents to fit the needs of the user and have a scalable foundation? Is it better to run our trained Named Entity Recognition on initial index or after text extraction has been uploaded to elastic?
Or does it make more sense to use an existing solution from below to not reinvent the wheel?
https://github.com/dadoonet/fscrawler
https://github.com/deepset-ai/haystack
https://github.com/RD17/ambar

Instead of reinventing the wheel, I'd recommend to use existing solutions such as Jina, there's a working example of pdf search using Jina. You can also search across different modalities(text, image, pdf, etc.) using this.

Related

Combine multiple VCF files into one large VCF file

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European, etc
Under each ethnicity, I have around 100+ files.
Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)
image is here
However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.
I already referred to this post and this post but don't think this addresses my query
How can I do this across all files under a specific ethnicity?
Can help me with this?
Is there any hail/python/R/other tools way to do this?
You could use Variant Transforms to achieve this goal. Variant Transforms is a tool for parsing and importing VCF files into BigQuery. It also can perform the reverse transform: export variants stored in BigQuery tables to VCF file. So basically you need to:  multiple VCF files -> BigQuery -> Single VCF file
Variant Transforms can easily handle multiple input files. It also can perform more complex logic to merge same variants across multiple files into the same record. After your variants are all loaded into BigQuery you could export them to VCF file.
Note that Variant Transforms creates a separate table for each chromosome to optimize query costs. You can easily create a VCF file for each chromosome and then merge them together to create a single one.
You can reach out to Variant Transforms team if you need help with this task.

Custom patterns for stream analytics blob storage

My question is about saving out data from stream analytics to blob storage . In our system we are collecting clictstream-data from many websites via event hubs. Then we are doing some small grouping and aggregating. After that we send the results to our blob storage.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
Is there any way of achieving this?
This is a duplicate question:
Stream Analytics: Dynamic output path based on message payload
Azure Stream Analytics -> how much control over path prefix do I really have?
The short version of the above is you can't do this in Stream Analytics. If you have too many target paths for multiple sinks to be feasible, your best bet is to stream to a single blob store sink and process the results with something other than ASA. Azure Functions, WebJobs or ADF tasks are a few possible solutions.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
As this official document stream-analytics-define-outputs mentioned about Path Prefix Pattern of Blob storage output:
The file path used to write your blobs within the specified container.
Within the path, you may choose to use one or more instances of the following 2 variables to specify the frequency that blobs are written:
{date}, {time}
Example 1: cluster1/logs/{date}/{time}
Example 2: cluster1/logs/{date}
Based on my understanding, you could create multiple blob output targets from a single Stream Analytics job for each of your websites, and in your SQL-like query language, you could filter the event data and send data to the specific output. For more details, you could refer to Common query patterns.

Large scale static file ( csv txt etc ) archiving solution

I am new to large scala data analytics and archiving so I though I ask this question to see if I am looking at things the right way.
Current requirement:
I have large number of static files in the filesystem. Csv, Eml, Txt, Json
I need to warehouse this data for archiving / legal reasons
I need to provide a unified search facility MAIN functionality
Future requirement:
I need to enrich the data file with additional metadata
I need to do analytics on the data
I might need to ingest data from other sources from API etc.
I would like to come up with a relatively simple solution with the possibility that I can expand it later with additional parts without having to rewrite bits. Ideally I would like to keep each part as a simple service.
As currently search is the KEY and I am experienced with Elasticsearch I though I would use ES for distributed search.
I have the following questions:
Should I copy the file from static storage to Hadoop?
is there any virtue keeping the data in HBASE instead of individual files ?
is there a way that once a file is added to Hadoop I can trigger an event to index the file into Elasticsearch ?
is there perhaps a simpler way to monitor hundreds of folders for new files and push them to Elasticsearch?
I am sure I am overcomplicating this as I am new to this field. Hence I would appreciate some ideas / directions I should explore to do something simple but future proof.
Thanks for looking!
Regards,

What are binary types in hadoop?

Hadoop - The Definitive Guide says
If you want to log binary types, plain text isn’t a suitable format.
My Questions is 1. Why not? 2. What are binary types?
and further:
Hadoop’s SequenceFileclass fits the bill in this situation, providing
a persistent data structure for binary key-value pairs. To use it as a
logfile format, you would choose a key, such as timestamp represented
by a LongWritable, and the value is a Writable that represents the
quantity being logged.
Why text file can't be used and sequence file is required?
On the same page, it was quoted that:
For some applications, you need a specialized data structure to hold your data. For doing
MapReduce-based processing, putting each blob of binary data into its own file doesn’t
scale, so Hadoop developed a number of higher-level containers for these situations.
e.g. Assume that you are uploading images in facebook and you have to remove duplicate images. You can't store image in textformat. What you can do : get MD5SUM of image file and if MD5SUM already exists in the system, just discard insertion of duplicate image. In your text file, you can simply have "Date:" and "Number of images uploaded". Image can be stored out side of HDFS system like CDN network or at some other web server

Huge files in hadoop: how to store metadata?

I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks
I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.
If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.
The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

Resources