Powershell - Substring and pivot of information - text-files

Disclaimer - complete amateur at powershell. I'm basically throwing myself into the deep end to try and learn, so please forgive my ignorance.
Here is my challenge - I have information in txt file format as such (EMC Storage array layouts):
Tier Name: Extreme Performance
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.02
What I want to offer the storage team is a report on where their goodies are at. So looking to end up with this:
Tier name Raid Type UCap ConCap AvCap %Sub ...etc
Extreme Performance r_5 4382.54 3923.65 458.89 89.53 ...
So it's a combo of, grabbing everything before the ":" as the heading and after the ":" to the CRLF as the data, and pivoting that into a table.
It gets better: there are more blocks of data like this in the txt file, separated by a CRLF, but using the same labels. So need to grab only the data from these and amend them to the table above.
I've gathered that perhaps I need to work a Get-Content and then manipulate the string with a -replace perhaps, but seems like Get-Content reads that first entry (Tier Name:) as a drive. Whoops.
Again, I'm keen to learn and welcome pointers. Been going at it for a few hours now...
Thank you very much.

What I'm understanding, is that you have many blocks like the one you've shown us, and that each block has exactly 8 entries (not more, not less). Also, you want to create a CSV file (am I right?). if I'm right, try this code:
$f="$env:TMP\=in.txt"
gc $f | %{
$csv="$env:TMP\=csv.CSV"
clc $csv -ea Ignore
${#items}=8
$o=[pscustomobject]#{}
}{
$splitted=$_-split':'
if($splitted){
$o|Add-Member $splitted[0] $splitted[1]
if(($o.psobject.properties.count|measure -sum).count -eq ${#items}){
$o|epcsv $csv -Append
$o=[pscustomobject]#{}
}
}
}{
ipcsv $csv -Header 'Tier name','Raid Type','UCap','ConCap','AvCap','%Sub','Higher','Lower'|
select -Skip 1|
ft -AutoSize
}
Notice that $f is the file with all entries; ${#items} is the number of items per block; $csv is the resulting CSV file; I replaced the CSV file headers just when displaying them.
Here's the output for a $f like below:
Tier name Raid Type UCap ConCap AvCap %Sub Higher Lower
--------- --------- ---- ------ ----- ---- ------ -----
Extreme Performance1 r_5 4382.54 3923.65 458.89 89.53% 0.00 8.1
Extreme Performance12 r_5 4382.54 3923.65 458.89 89.53% 0.00 8.12
Extreme Performance123 r_5 4382.54 3923.65 458.89 89.53% 0.00 8.123
Extreme Performance1234 r_5 4382.54 3923.65 458.89 89.53% 0.00 8.1234
Here's an example of $f file content:
Tier Name: Extreme Performance1
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.1
Tier Name: Extreme Performance12
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.12
Tier Name: Extreme Performance123
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.123
Tier Name: Extreme Performance1234
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.1234
The blank lines may or may not exist.

Well, one route you can take if you're not married to the idea of doing a table is to use the nifty ConvertFrom-StringData cmdlet, which will let you quickly import data in a Name = Value format.
To use that, all we have to do is replace the colons with an equals sign, which we can do like this.
#Setup $string
$string ="Tier Name: Extreme Performance
Raid Type: r_5
User Capacity (GBs): 4382.54
Consumed Capacity (GBs): 3923.65
Available Capacity (GBs): 458.89
Percent Subscribed: 89.53%
Data Targeted for Higher Tier (GBs): 0.00
Data Targeted for Lower Tier (GBs): 8.02"
I didn't have a file with this info, so I just loaded it right into memory in the above step. For your purposes, just run $string=Get-Content -Path .\PathToFile\File.txt to import the content into $string.
Now we can call the universal -Replace operator to replace characters in a string/variable/object, with the following syntax.
$string -replace ":","="
Which gives us this result:
>Tier Name= Extreme Performance
Raid Type= r_5
User Capacity (GBs)= 4382.54
Consumed Capacity (GBs)= 3923.65
Available Capacity (GBs)= 458.89
Percent Subscribed= 89.53%
Data Targeted for Higher Tier (GBs)= 0.00
Data Targeted for Lower Tier (GBs)= 8.02
See that, it's in Name = Value format, precisely what we need for the ConvertFrom-StringData cmdlet. We can now just pipe that right into our convert cmdlet to start working with the data!
$string -replace ":","=" | ConvertFrom-StringData | ft -AutoSize
Name Value
---- -----
Tier Name Extreme Performance
Consumed Capacity (GBs) 3923.65
Data Targeted for Higher Tier (GBs) 0.00
Percent Subscribed 89.53%
Raid Type r_5
Available Capacity (GBs) 458.89
Data Targeted for Lower Tier (GBs) 8.02
User Capacity (GBs) 4382.54
I hope this gets you started on your way!

Related

data structure for a pricing engine (Non dynamic)

I want to design a "pricing calculator" for a cloud platform that takes into account different "service Types (Infra or PaaS)", "VM sizes", "storage", "Compute units (a custom metric of cpu cores + RAM for PaaS services)". These options can be added in any order by customer with any combination.
ex:For Infra: small (2 cpu cores + 2 GB RAM), medium (4 cpu cores + 4 GB),
large (6 CPU cores + 8 GB of RAM) and so on...
For Storage: by GB allocated (both pay as u go and pre-paid and post paid)
For PaaS: Compute Units. Ex: Provisioning 10 CU's per hour translates to 2 CPU cores and 4 GB of RAM.
I want a data structure that can represent all the above combinations of variables

What is reference when it says L1 Cache Reference or Main Memory Reference

So I am trying to learn performance metrics of various components of computer like L1 cache, L2 cache, main memory, ethernet, disk etc as below:
Latency Comparison Numbers
--------------------------
L1 cache **reference** 0.5 ns
Branch mispredict 5 ns
L2 cache **reference** 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory **reference** 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
Read 4 KB randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
I don't think the reference mentioned above is for how much data is read in bits or bytes. But is actually about maybe accessing one address in cache or memory.
Can someone please explain better what is this reference that's happening in 0.5 n/s ?
This table is listing typical numbers for some representative system,
as the actual values for a real example system would hardly be so "smooth numbers" but complicated sums over some non-even multiples of CPU and/or bus clock periods.
We could find such a table in a textbook for educative use.
This one apparently found its way into a general introduction into system designing1
from some conference presentations Google AI's lead person,
Jeff Dean
held in back in 20093,4.
The two presentation PDFs3,4
do not give an explicit definition what exactly was meant by "reference" in those tables. Instead, the tables are presented to point out that the ability for "back-of-the-envelope calculations" is crucial for successful system design.
The term "reference" likely means retrieving a piece of information from the corresponding level of memory if the requested value is maintained there, so that it doesn't have to be reloaded from a slower source:
L1 cache <- L2 cache <- Main memory (RAM) <- Disk (e.g., swap)
The upper-level sources (RAM, disk) can just be seen as a very rough sketch because here you will find lots of sub-levels and variants (type of mass device, internal cache on the disk's chipset, buses/bridges etc. etc.).
The present numbers appear to be a conclusion of experiences at Google's data center.
Therefore, let's assume they are based on some high-performance class hardware which was relevant in 2009 (or earlier).
Today (2020), the numbers should not be taken literally but to demonstrate the orders of magnitude in the context of the corresponding values for other levels of data transfer.
The label "branch mispredict" stands for all cases when a fetch operation from the next level is necessary, because a mispredicted branching decision is the most important reason for cases when such a fetch operation is critical w. r. t. latencies.
In other cases, branch prediction infrastructure is supposed to trigger data fetch operations in time so all latencies beyond the low "reference" value are hidden behind pipeline operations.
1
The URL you gave us in comment discussion
"Latency numbers every programmer should know" in: "The System Design Primer"
references the following sources:
2
Jeff Dean: "Latency Numbers Every Programmer Should Know", 31 May 2012.
"Originally by Peter Norvig ("Teach Yourself Programming in Ten Years") with some updates from Brendan", 1 Jun 2012.
3
Jeff Dean:
"Designs, Lessons and Advice from Building Large Distributed Systems",
13 Oct 2009,
page 24.
4
Jeff Dean:
"Software Engineering Advice from Building Large-Scale Distributed Systems",
17 Mar 2009,
page 13.
Going to the specific question about what is an L1 cache - it helps to understand multi-level caching -- https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL
While creating any cache there is a trade-off between hit-rate and latency. Larger caches generally have higher hit rate but also longer latency. To achieve the best of both worlds, many architectures implement 2 or more level of cache - an L1 which is small, super-fast backed by L2 which will be looked up in case of miss from L1, L2 being larger but also slower and so on. The metrics posted in your reference are a rough ballpark of an L1 hit, it would appear.

Latency of accessing main memory is almost the same order of sending a packet

Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL # 5k4 RPM disk,
8.3 ms per CYL # 7k2 RPM disk,
6.0 ms per CYL # 10k RPM disk,
4.0 ms per CYL # 15k RPM disk,
3.3 ms per CYL # 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path
Laws of Physics decide
Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.

AWS EC2: Baseline of 3 IOPS per GiB with a minimum of 100 IOPS

I seem to remember the policy was Baseline of 3 IOPS per GiB. If I have a volumn of 8GB, I get 24 IOPS. Now with the a minimum of 100 IOPS, do I get at least 100 IOPS no matter how small my volumn is?
Yes, at 33.33 GiB and below, an EBS SSD (gp2) volume will have 100 IOPS. This is spelled out clearly in the docs.

Is there a way, using wmic, to reverse engineer which volume maps to which partition(s)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
Problem.. I only have access to wmic... Lame I know.. but need to figure out which volume corresponds to what partition(s) which correspond to what disk.. I know how to correspond which partition corresponds to what disk because the disk id is directly in the results of the wmic query. However, the first part of the problem is more difficult. How to correlate which volume belongs to which partitions?..
Is there a way, using wmic, to reverse engineer which volume maps to which partition(s)?
If so how would this query look?
wmic logicaldisk get name, volumename
for more info use wmic logicaldisk get /?
The easiest way to do this is with diskpart from a command prompt:
C:\>diskpart
Microsoft DiskPart version 10.0.10586
Copyright (C) 1999-2013 Microsoft Corporation.
On computer: TIMSPC
DISKPART> select disk 0
Disk 0 is now the selected disk.
DISKPART> detail disk
HGST HTS725050A7E630 *(Note: This is the Model of my hard disk)*
Disk ID: 00C942C7
Type : SATA
Status : Online
Path : 0
Target : 0
LUN ID : 0
Location Path : PCIROOT(0)#PCI(1F02)#ATA(C00T00L00)
Current Read-only State : No
Read-only : No
Boot Disk : Yes
Pagefile Disk : Yes
Hibernation File Disk : No
Crashdump Disk : Yes
Clustered Disk : No
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- --------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 464 GB Healthy Boot
Volume 2 NTFS Partition 843 MB Healthy Hidden
DISKPART> exit
Leaving DiskPart...
C:\>
You have access to a command line since you have access to WMIC, so this method should work.
Based on the comments below:
No, there is no way to use WMIC to determine with 100% accuracy exactly which partition corresponds to which partition on a specific drive. The problem with determining this information via WMI is that not all drives are basic drives. Some disks may be dynamic disks containing a RAID volume that spans multiple drives. Some may be a complete hardware implemented abstraction like a storage array (for example, a p410i RAID controller in an HP ProLiant). In addition, there are multiple partitioning schemes (eg UEFI/GPT vs BIOS/MBR). WMI is however independent of its environment. That is, it doesn't care about the hardware. It is simply another form of abstraction that provides a common interface model that unifies and extends existing instrumentation and management standards.
To get the level of detail you desire will require a tool that can interface at a much lower level like the driver for the device and hope that the driver provides the information you need. If it doesn't, you will be looking at very low level programming to interface with the device itself...essentially creating a new driver that provides the information you want. But based on your limitation of only having command line access, Diskpart is the closest prebuilt tool you will find.
There are volumes which do not have traditional letters.
And? Diskpart can select disk, partitions, and volumes based on the number assigned. The drive letter is irrelevant.
At no point in disk part is any kind of ID listed which allows a user to 100% know which partition they are dealing with when they reference a volume.
Here is an example from one of my servers with two 500gb hard drives. The first in the Boot/OS drive. The second has 2gb of unallocated space.
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- ------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
DISKPART> select volume 2
Volume 2 is the selected volume.
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 465 GB 0 B
* Disk 1 Online 465 GB 2049 MB
DISKPART> list partition
Partition ### Type Size Offset
------------- ---------------- ------- -------
* Partition 1 Primary 463 GB 1024 KB
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- ------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
* Volume 2 D New Volume NTFS Partition 463 GB Healthy
DISKPART>
Notice the asterisks? Those denote the active disk, partition, and volume. While these are not the ID you require to allow a user to 100% know which partition they are dealing with, you can at least clearly see that Volume 2 (D:) is on Partition 1 of Disk 1.
There are volumes that are RAW disks which is essentially saying.. this is a raw disk and I want to find out where these raw disks are at.
As you can see after I have created a volume with no file system on the 2gb of free space, this does not make any difference:
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- -------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
Volume 3 RAW Partition 2048 MB Healthy
DISKPART> select volume 3
Volume 3 is the selected volume.
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- -------
Volume 0 System NTFS Partition 350 MB Healthy System
Volume 1 C OSDisk NTFS Partition 465 GB Healthy Boot
Volume 2 D New Volume NTFS Partition 463 GB Healthy
* Volume 3 RAW Partition 2048 MB Healthy
DISKPART> list partition
Partition ### Type Size Offset
------------- ---------------- ------- -------
Partition 1 Primary 463 GB 1024 KB
* Partition 2 Primary 2048 MB 463 GB
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 465 GB 0 B
* Disk 1 Online 465 GB 1024 KB
The reason that I am using wmic is because I need to script out many disk ops. Have you ever tried to script out getting information from diskpart?
No, but it is scriptable.
In your sample data, you can enumerate the disk, volumes, and partitions. By looping through each object and selecting it, you can create a map of which volume is on which partition and which drive contains that partition. Diskpart may not provide 100% of the data you need 100% of the time with 100% of the accuracy you want, but it is the closest command line tool you are going to find to meet your goal.

Resources