I'm starting to explore using Elasticsearch to help analyze engineering data produced in a manufacturing facility. One of the key metrics we analyze if the First Pass Yield (FPY) of any given process. So imagine I had some test data like the following:
Item | Process | Pass/Fail | Timestamp
+-----+---------+-----------+----------
| A | 1 | Fail | 1 | <-- First pass failure
| A | 1 | Pass | 2 |
| A | 2 | Pass | 3 |
| A | 3 | Fail | 4 | <-- First pass failure
| A | 3 | Fail | 5 |
| A | 3 | Pass | 6 |
| A | 4 | Pass | 7 |
---------------------------------------
What I'd like to get out of this is the ability to query this index/type and determine what the first pass yield is by process. So conceptually I want to count the following in some time period using a set of filters:
How many unique items went through a given process step
How many of those items passed on their first attempt at a process
With a traditional RDBMS I can do this easily with subqueries to pull out and combine these counts. I'm very new to ES, so I'm not sure how to query the process data to count how many failures occurred for the first time an item went through that process
My real end goal is to include this on a Kibana dashboard so my customers can quickly analyze the FPY data for different processes over various time periods. I'm not there yet, but I think Kibana will let me use a JSON query if this query requires that today.
Is this possible with Elasticsearch, or am I trying to use the wrong tool for the job here?
Related
When backing up a database in CockroachDB the number of rows/index_entries/bytes written are included in the output:
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+---------+---------------+--------------
903515943941406094 | succeeded | 1 | 100000 | 200000 | 4194304
When restoring the same backup the number of the same metrics are reported:
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+---------+---------------+--------------
803515943941406094 | succeeded | 1 | 99999 | 199999 | 4194200
What causes the difference between the two and is all my data restored?
A couple of things impact the metrics you are observing, the backup phase may make a copy of specific system tables for metadata, which are not directly restored from the backup image. The configurations from these system tables are applied to your restored database, but they don't count toward the reported metrics for rows / index_entries.
So a relatively small delta between the two values is not unusual in full/backup restore scenarios and not out of the ordinary.
I do research in the field of Health-PM and facing an unstructured big data which needs a preprocessing phase for converting to suitable event log.
I've just googled and understood no ProM plug-in, stand-alone code, or script has developed specially for this task. Except Celonis, which has claimed developed an event log convertor. I'm also writing an event log generator code for my specific case study.
I just want to know, is there any business solution, case study or article on this topic which investigated this issue?
Thanks.
Soureh
What do you exactly mean with unstructured? Is this a bad-structured table like the example you provided, or is it data that is not structured at all (e.g. a hard disk with files)?
In the first situation, Celonis indeed provide an option to extract events based on tables using Vertica SQL. In their free SNAP environment you can learn how to do that.
In the latter, I quess that at least semi-structured data is needed to extract events on large scale, otherwise your script has no clue where to look for.
Good question! Many process mining papers mention that most of the existing information systems are PAIS (process-aware information system) hence, qualified to perform process mining on them. This is true, BUT, it does not mean you can get the data out-of-the-box!
What's the solution? You may transform the existing data (typically from a relational database of your business solution, e.g., an ERP or HIS system) into an event log that process mining can understand.
It works like this: you look into the table containing, e.g., patient registration data. You need the patient ID of this table and the timestamp of registration for each ID. You create an empty table for your event log, typically called "Activity_Table". You consider giving a name to each activity depending on the business context. In our example "Patient Registration" would be a sound name. You insert all the patient IDs with their respective timestamp into the Activity_Table followed by the same activity name for all rows, i.e., "Patient Registration". The result looks like this:
|Patient-ID | Activity | timestamp |
|:----------|:--------------------:| -------------------:|
| 111 |"Patient Registration"| 2021.06.01 14:33:49 |
| 112 |"Patient Registration"| 2021.06.18 10:03:21 |
| 113 |"Patient Registration"| 2021.07.01 01:20:00 |
| ... | | |
Congrats! you have an event log with one activity. The rest is just the same. You create the same table for every important action that has a timestamp in your database, e.g., "Diagnose finished", "lab test requested", "treatment A finished".
|Patient-ID | Activity | timestamp |
|:----------|:-----------------:| -------------------:|
| 111 |"Diagnose finished"| 2021.06.21 18:03:19 |
| 112 |"Diagnose finished"| 2021.07.02 01:22:00 |
| 113 |"Diagnose finished"| 2021.07.01 01:20:00 |
| ... | | |
Then you UNION all these mini tables and sort it based on Patient-ID and then by timestamp:
|Patient-ID | Activity | timestamp |
|:----------|:--------------------:| -------------------:|
| 111 |"Patient Registration"| 2021.06.01 14:33:49 |
| 111 |"Diagnose finished" | 2021.06.21 18:03:19 |
| 112 |"Patient Registration"| 2021.06.18 10:03:21 |
| 112 |"Diagnose finished" | 2021.07.02 01:22:00 |
| 113 |"Patient Registration"| 2021.07.01 01:20:00 |
| 113 |"Diagnose finished" | 2021.07.01 01:20:00 |
| ... | | |
If you notice, the last two rows have the same timestamp. This is very common when working with real data. To avoid this, we need an extra column called "sorting" which helps the process mining algorithm to understand the "normal" order of activities with the same timestamp according to the nature of the underlying business. In this case, we can easily know that registration happens before diagnosis hence, we assign a low value (e.g., 1) to all "Patient Registration" activities. The table might look like this:
|Patient-ID | Activity | timestamp |Order |
|:----------|:--------------------:|:-------------------:| ----:|
| 111 |"Patient Registration"| 2021.06.01 14:33:49 | 1 |
| 111 |"Diagnose finished" | 2021.06.21 18:03:19 | 2 |
| 112 |"Patient Registration"| 2021.06.18 10:03:21 | 1 |
| 112 |"Diagnose finished" | 2021.07.02 01:22:00 | 2 |
| 113 |"Patient Registration"| 2021.07.01 01:20:00 | 1 |
| 113 |"Diagnose finished" | 2021.07.01 01:20:00 | 2 |
| ... | | | |
Now, you have an event log that process mining algorithms undertand!
Side note:
there has been many attempts to automate event log extraction process. The works of "Eduardo González López de Murillas" are really interesting if you want to follow this topic. I could also recommend this open-access paper by Eduardo et al. 2018:
"Connecting databases with process mining: a meta model and toolset" (https://link.springer.com/article/10.1007/s10270-018-0664-7)
I analyzed a project with SonarQube 6.3, and it gave me the error:
32 more lines of code need to be covered by tests to reach the minimum
threshold of 65.0% lines coverage
It's related to the rule:
Lines should have sufficient coverage by tests
I would like to know if this rule covers all type of tests that I make, or a specific one or does it mean that SonarQube could not reach those lines to analyze.
The reason I am asking this is I don't have tests at all, so this issue message could mean that SonarQube could recognize some tests for other lines which is not the case, so how could that happen?
Starting from 6.2, SonarQube enables the ability to recognize "executable lines" in code files whether or not there are any tests in the files. The feature must also be supported and fed by your analyzer. I'm guessing you're using an analyzer version that does provide that data, and that's where you're getting the calculation of missing coverage on these files that are untouched by unit tests.
Note that before this functionality was added, the situation looked something like this
+--------------+-----------+-------+
| File | Cvd lines | Cvg % |
+--------------+-----------+-------+
| 100LineFile | 75 | 75 |
+--------------+-----------+-------+
| Total | 75 | 75 |
+--------------+-----------+-------+
AND
+--------------+-----------+-------+
| File | Cvd lines | Cvg % |
+--------------+-----------+-------+
| 100LineFile | 75 | 75 |
| 100LineFile2 | 0 | - |
+--------------+-----------+-------+
| Total | 75 | 75 |
+--------------+-----------+-------+
Because files that weren't touched by any unit tests were simply omitted from the calculations, giving a falsely rosy picture of overall coverage. Now it looks like this:
+--------------+-----------+-------+
| File | Cvd lines | Cvg % |
+--------------+-----------+-------+
| 100LineFile | 75 | 75 |
| 100LineFile2 | 0 | 0 |
+--------------+-----------+-------+
| Total | 75 | 37.5|
+--------------+-----------+-------+
What is the difference between Operational and Config in YANG model? Is it a correct way to supporting GET,PUT,POST and DELETE interfaces both in Operational and Config ?
Config is what represents configuration data, usually what will be writable via the northbound agents (CLI, Netconf, Web, etc.), it is also what will be retrieved in a get-config Netconf operation.
Operational data is status data, data that is not writable via the northbound agents, it will come from a data provider application.
A web client should only be able to do a GET operation on operational data. Because it doesn't make sense to allow a client to change the information about a status.
For config data it makes sense to have all the operations.
NETCONF separates configuration and state (or operational) data:
The information that can be retrieved from a running system is separated into two classes, configuration data and state data. Configuration data is the set of writable data that is required to transform a system from its initial default state into its current state. State data is the additional data on a system that is not configuration data such as read-only status information and collected statistics.
RESTCONF works as NETCONF, but on HTTP: it does map CRUD verbs onto NETCONF operations:
+----------+-------------------------------------------------------+
| RESTCONF | NETCONF |
+----------+-------------------------------------------------------+
| OPTIONS | none |
| | |
| HEAD | <get-config>, <get> |
| | |
| GET | <get-config>, <get> |
| | |
| POST | <edit-config> (nc:operation="create") |
| | |
| POST | invoke an RPC operation |
| | |
| PUT | <copy-config> (PUT on datastore) |
| | |
| PUT | <edit-config> (nc:operation="create/replace") |
| | |
| PATCH | <edit-config> (nc:operation depends on PATCH content) |
| | |
| DELETE | <edit-config> (nc:operation="delete") |
+----------+-------------------------------------------------------+
On supporting GET,PUT,POST and DELETE, if you are reffering to http methods here, you should probably follow RestConf
If I take a snapshot of a persistent disk, then try to see get information about the snapshot in gcutil, the data is always incomplete. I need to see this data since snapshots are differential.:
server$ gcutil getsnapshot snapshot-3
+----------------------+-----------------------------------+
| name | snapshot-3 |
| description | |
| creation-time | 2014-07-30T06:52:56.223-07:00 |
| status | READY |
| disk-size-gb | 200 |
| storage-bytes | |
| storage-bytes-status | |
| source-disk | us-central1-a/disks/app-db-1-data |
+----------------------+-----------------------------------+
Is there a way to determine what this snapshot is actually occupying? gcutil and the web UI are the only resources I know of, and they are both not displaying this information.
unfortunately it's a bug, known by google developers. They are working on that....