lightest way to join 2 CSV (like an SQL LEFT JOIN) - windows

I'm reling on community expertise to guide me in the best way about following topic.
In a professional context runing on windows without possibility to install MS-Office application I need to distribut to my team a way to join 2 CSV files and produce a 3rd CSV files as output. Exactly as if we run a SQL query like :
SELECT f1.*, f1.bar = f2.bar as baz
FROM CSVfile1 as f1
LEFT JOIN CSVfile2 as f2
ON f1.key = f2.key
Aims is currently reached with Excel + VBA but MS-office package will be removed and no more accessibly. A solution with MS-Acces is not envisageable because of the same reason.
The goal is to allow any body to actualise the 3rd CSV without any competence and specific installation on it Computer. So an approach with python or MS-SQL-Servr is not good also.
I was thinking to accomplish that with Powshell script but first. I'm not habit to use PowerShell but I can learn.
But before trying that I ask to the community if this is the best way ? Or if there is better solution ? (requirements: Windows OS (latest version), No MS-office, No specific install).
Thank you all.

PowerShell has no built-in join functionality (akin to SQL's[1]) as of v7.2, though adding a Join-Object cmdlet is being proposed in GitHub issue #14994; third-party solutions are available, via the PowerShell Gallery (e.g., JoinModule).
For now, if installing third-party tools isn't an option, you can roll your own solution with the following approach, which usesImport-Csv to load the CSV files, an auxiliary hashtable to find corresponding rows, and Add-Member to add columns (properties).
# Create sample CSV files.
$csv2 = #'
key,bar,quux
key1,bar1,quux1
key2,bar2,quux2
key3,bar3,quux3
'# > ./CSVFile1.csv
#'
key,bar
key1,bar1
key2,bar2a
'# > ./CSVFile2.csv
# Import the the 2nd file and load its rows
# (as objects with properties reflecting the columns)
# into a hashtable, keyed by the column 'key' values.
$hash = #{}
foreach ($row in Import-Csv ./CSVFile2.csv) {
$hash[$row.key] = $row
}
# Import the 1st file and process each row (object):
# Look for a matching object from the 2nd file and add
# a calculated column derived from both objects to the
# input object.
Import-Csv ./CSVFile1.csv | ForEach-Object {
$matching = $hash[$_.key]
$_ |
Add-Member -PassThru baz $(if ($matching) { [int] ($matching.bar -eq $_.bar) })
}
Pipe the last statement to Export-Csv to export the resulting objects to a CSV file.
(E.g.
... | Export-Csv -NoTypeInformation -Encoding utf8 Results.csv)
The above yields the following:
key bar quux baz
--- --- ---- ---
key1 bar1 quux1 1
key2 bar2 quux2 0
key3 bar3 quux3
[1] There is a -join operator, but its purpose is to join the elements of a single array to form a single string.

Here's an off the wall answer using the sqlite command-line shell (a single 900kb executable) and the same sql join command. https://sqlite.org/download.html Sqlite seems to have trouble with utf16 or "unicode" text files. Even Excel has more trouble importing a utf16 csv.
# making csv file with ">" (utf16) caused this error:
# CREATE TABLE csvfile1(...) failed: duplicate column name:
'key,bar,quux
key1,bar1,quux1
key2,bar2,quux2
key3,bar3,quux3' | set-content csvfile1.csv
'key,bar
key1,bar1
key2,bar2a' | set-content csvfile2.csv
'.mode csv
.import csvfile1.csv csvfile1
.import csvfile2.csv csvfile2
.headers on
SELECT f1.*, f1.bar = f2.bar as baz
FROM CSVfile1 as f1
LEFT JOIN CSVfile2 as f2
ON f1.key = f2.key' | .\sqlite3
# output
# key,bar,quux,baz
# key1,bar1,quux1,1
# key2,bar2,quux2,0
# key3,bar3,quux3,

Related

Trying to get all paths in a YAML file

I've got an input YAML file (test.yml) as follows:
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
The source contains square brackets for some keys (possibly empty).
I'm trying to get a line by line list of all the paths in the file, ideally like:
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
I've used the yamlpaths library and the yaml-paths CLI, but can't get the desired output. Trying this:
yaml-paths -m -s =foo -K test.yml
outputs:
foo.x
foo.y
foo.ip_range\[\'initial\'\]
foo.ip_range\[\]
foo.array\[\'first\'\]
Each path is on one line, but the output has all the escape characters ( \ ). Modifying the call to remove the -m option ("expand matching parent nodes") fixes that problem but the output is then not one path per line:
yaml-paths -s =foo -K test.yml
gives:
foo: {"x": 12, "y": "hello world", "ip_range['initial']": "1.2.3.4", "ip_range[]": "tba", "array['first']": "Cluster1"}
Any ideas how I can get the one line per path entry but without the escape chars? I was wondering if there is anything for path querying in the ruamel modules?
Your "paths" are nothing more than the joined string representation of the keys (and probably indices) of the
mappings (and potentially sequences) in your YAML document.
That can be trivially generated from data loaded from YAML with a recursive function:
import sys
import ruamel.yaml
yaml_str = """\
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
"""
def pathify(d, p=None, paths=None, joinchar='.'):
if p is None:
paths = {}
pathify(d, "", paths, joinchar=joinchar)
return paths
pn = p
if p != "":
pn += '.'
if isinstance(d, dict):
for k in d:
v = d[k]
pathify(v, pn + k, paths, joinchar=joinchar)
elif isinstance(d, list):
for idx, e in enumerate(d):
pathify(e, pn + str(idx), paths, joinchar=joinchar)
else:
paths[p] = d
yaml = ruamel.yaml.YAML(typ='safe')
paths = pathify(yaml.load(yaml_str))
for p, v in paths.items():
print(f'{p} -> {v}')
which gives:
foo.x -> 12
foo.y -> hello world
foo.ip_range['initial'] -> 1.2.3.4
foo.ip_range[] -> tba
foo.array['first'] -> Cluster1
array2[] -> bar
While Anthon's answer certainly produces the output you were after, I think your question was specifically about how to get the yaml-paths command to produce the desired output. I'll address that original question.
As of version 3.5.0, the yamlpath project's yaml-paths command supports a --noescape option which removes the escape symbols from output. Using your input file and the new option, you may find this output more to your liking:
$ yaml-paths --nofile --expand --keynames --noescape --values --search='=~/.*/' test.yml
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
Note:
Using the --values option includes the value with each YAML Path.
For interest, I changed the --search expression to match every node in the input file rather than only the "foo" data.
The default output (without setting --noescape) produces YAML Paths which can be used as direct input into other YAML Path parsers and processors; setting --noescape changes this to render human-friendly paths which may not work as downstream YAML Path input.
Disclaimer: I am the author of the yamlpath project. Should you ever run into issues or have questions about it, please visit the project's GitHub project site and engage me via Issues (bugs and feature requests) or Discussions (questions). Thank you!

TextFSM Template for Netmiko for "inc" phrase

I am trying to create a textfsm template with the Netmiko library. While it works for most of the commands, it does not work when I try performing "inc" operation in the network device. The textfsm index file seems like it is not recognizing the same command for 2 different templates; for instance:
If I am giving the command - show running | inc syscontact
And give another command - show running | inc syslocation
in textfsm index; the textfsm template seems like it is recognizing only the first command; and not the second command.
I understand that I can get the necessary data by the regex expression for syscontact and syslocation for the commands( via the template ), however I want to achieve this by the "inc" command from the device itself. Is there a way this can be done?
you need to escape the pipe in the index file. e.g. sh[[ow]] ru[[nning]] \| inc syslocation
There is a different way to parse that you want all datas which is called TTP module. You can take the code I wrote below as an example. You can create your own templates.
from pprint import pprint
from ttp import ttp
import json
import time
with open("showSystemInformation.txt") as f:
data_to_parse = f.read()
ttp_template = """
<group name="Show_System_Information">
System Name : {{System_Name}}
System Type : {{System_Type}} {{System_Type_2}}
System Version : {{Version}}
System Up Time : {{System_Uptime_Days}} days, {{System_Uptime_HR_MIN_SEC}} (hr:min:sec)
Last Saved Config : {{Last_Saved_Config}}
Time Last Saved : {{Last_Time_Saved_Date}} {{Last_Time_Saved_HR_MIN_SEC}}
Time Last Modified : {{Last_Time_Modified_Date}} {{Last_Time_Modifed_HR_MIN_SEC}}
</group>
"""
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
# print result in JSON format
results = parser.result(format='json')[0]
print(results)
Example run:
[appadmin#ryugbz01 Nokia]$ python3 showSystemInformation.py
[
{
"Show_System_Information": {
"Last_Saved_Config": "cf3:\\config.cfg",
"Last_Time_Modifed_HR_MIN_SEC": "11:46:57",
"Last_Time_Modified_Date": "2022/02/09",
"Last_Time_Saved_Date": "2022/02/07",
"Last_Time_Saved_HR_MIN_SEC": "15:55:39",
"System_Name": "SR7-2",
"System_Type": "7750",
"System_Type_2": "SR-7",
"System_Uptime_Days": "17",
"System_Uptime_HR_MIN_SEC": "05:24:44.72",
"Version": "C-16.0.R9"
}
}
]

Snakemake, how to change output filename when using wildcards

I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"

Prettify YAML with comments

1. Summary
I can't find, how I can automatically prettify my YAML files.
2. Data
Example:
    I have SashaPrettifyYAML.yaml file:
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
3. Expected behavior
I want to delete {braces}:
sasha_commands:
# Sasha comment
sasha_command_help:
call: sublime.command_help
caption: 'Sasha Command: Command Help'
4. Not helped
Pretty YAML (based on PyYAML) and online formatters as YAML Formatter and OnlineYAMLTools delete comments;
I can't find the required option in ruamel.yaml.cmd;
align-yaml align, not prettify YAML file.
There is no option to do this in ruamel.yaml.cmd, but it is fairly straightforward to do this with a small python program and using ruamel.yaml, by loading and dumping in round-trip mode (the default).
The only thing you need to do is make sure the flow-style on the data-structure that is the value for the key sasha_command_help is set to block-style (which is how I interpret your definition of "prettifying YAML"):
import sys
import ruamel.yaml
yaml_str = """\
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
"""
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
data['sasha_commands']['sasha_command_help'].fa.set_block_style()
yaml.dump(data, sys.stdout)
this will exactly give the output you expect.
A recursive data structure walker can be found in scalarstring.py in the ruamel.yaml source, and adapted to make a generic "make-everything-block-style" routine:
import sys
import ruamel.yaml
def block_style(base):
"""
This routine walks over a simple, i.e. consisting of dicts, lists and
primitives, tree loaded from YAML. It recurses into dict values and list
items, and sets block-style on these.
"""
if isinstance(base, dict):
for k in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(base[k])
elif isinstance(base, list):
for elem in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(elem)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
file_in = sys.argv[1]
file_out = sys.argv[2]
with open(file_in) as fp:
data = yaml.load(fp)
block_style(data)
with open(file_out, 'w') as fp:
yaml.dump(data, fp)
If you store the above in prettifyyaml.py you can call it with:
python prettifyyaml.py SashaPrettifyYAML.yaml Prettified.yaml
Since you are already using single quotes around the scalar that has embedded spaces, you won't see a change if you leave out yaml.preserve_quotes = True. But if you had used a double quoted scalar then that line makes sure the double quotes are preserved.
I had the same problem. I wrote my own YAML beautifier https://github.com/wangkuiyi/yamlfmt. I hope it helps.
I tried top results from Google, but none of them address the requirements of https://sqlflow.org/sqlflow, which I am leading:
https://pypi.org/project/yamlfmt cannot handle a file of multiple YAML documents separated by ---
https://github.com/devopyio/yamlfmt cannot handle multiple files.
https://github.com/miekg/yamlfmt/blob/master/fmt.go cannot replace (inline edit) the input files.
You can use yq tool - it's easy to install and use, and it's well maintained.
Supposing you have example.yml file to format, it can be processed by following ways:
from file: yq r --unwrapScalar -p pv -P example.yml '*'
from stdin: cat example.yml | yq r --unwrapScalar -p pv -P - '*'

1003 error (unable to find an operator for alias ) in group function in pig

I have written a .pig file whose content is :
register /home/tuhin/Documents/PigWork/pigdata/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '/pigdata/salaryTravelReport.csv' using csvloader();
x = foreach xyz generate $0 as name:chararray, $1 as title:chararray, replace($2, ',','') as salary:bytearray, replace($3, ',', '') as travel:bytearray, $4 as orgtype:chararray, $5 as org:chararray, $6 as year:bytearray;
refined = foreach x generate name, title, (float)salary, (float)travel, orgtype, org, (int)year;
year2010 = filter refined by year == 2010;
byjobtitile = GROUP year2010 by title;
The purpose is to remove ',' in dollar value in 2 columns and then group the data by jobtitle. When I am running this using run command there is not error. Even dumping of year2010 is working fine. But dumping byjobtitiel is giving error:
error in dumping
The output of the log file is:
Pig Stack Trace
--------------- ERROR 1003: Unable to find an operator for alias byjobtitle
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1003: Unable
to find an operator for alias byjobtitle at
org.apache.pig.PigServer$Graph.buildPlan(PigServer.java:1544) at
org.apache.pig.PigServer.storeEx(PigServer.java:1029) at
org.apache.pig.PigServer.store(PigServer.java:997) at
org.apache.pig.PigServer.openIterator(PigServer.java:910) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at
org.apache.pig.Main.run(Main.java:565) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I am new to bigdata and dont have much knowledge. But it looks like there is a problem in data type. Can anyone help me out?
The issue is due to "CSVLoader" you are using. This will have ',' as default delimiter. Since your data also has "," in some of its field like salary and travel, the positional index is getting changed. So if your data is something like this
name title salary travel orgtype org year
A B 10,000 23,1357 ORG_TYPE ORG 2016
then using CSVLoader will make "A B 10" as the first field, "000 23" as the second field and "1357 ORG_TYPE ORG 2016" as the third field based on ","
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '<path to your file>' using csvloader();
a = foreach xyz generate $0;
2016-06-07 12:28:12,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1<br>
(A B 10)<br>
You can make your delimiter different so that it is not present in any field value.
Try using CSVExcelStorage. You can use its constructor to explicitly define the delimiter
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage('|','NO_MULTILINE','NOCHANGE');
It will work fine as long as same identifier is not present as ;
delimiter
any field value

Resources