Convert file from Windows to UNIX through Powershell or Batch - windows

I have a batch script that prompts a user for some input then outputs a couple of files I'm using in an AIX environment. These files need to be in UNIX format (which I believe is UTF8), but I'm looking for some direction on the SIMPLEST way of doing this.
I don't like to have to download extra software packages; Cygwin or GnuWin32. I don't mind coding this if it is possible, my coding options are Batch, Powershell and VBS. Does anyone know of a way to do this?
Alternatively could I create the files with Batch and call a Powershell script to reform these?
The idea here is a user would be prompted for some information, then I output a standard file which are basically prompt answers in AIX for a job. I'm using Batch initially, because I didn't know that I would run into this problem, but I'm kind of leaning towards redoing this in Powershell. because I had found some code on another forum that can do the conversion (below).
% foreach($i in ls -name DIR/*.txt) { \
get-content DIR/$i | \
out-file -encoding utf8 -filepath DIR2/$i \
}
Looking for some direction or some input on this.

You can't do this without external tools in batch files.
If all you need is the file encoding, then the snippet you gave should work. If you want to convert the files inline (instead of writing them to another place) you can do
Get-ChildItem *.txt | ForEach-Object { (Get-Content $_) | Out-File -Encoding UTF8 $_ }
(the parentheses around Get-Content are important) However, this will write the files in UTF-8 with a signature at the start (U+FEFF) which some Unix tools don't accept (even though it's technically legal, though discouraged to use).
Then there is the problem that line breaks are different between Windows and Unix. Unix uses only U+000A (LF) while Windows uses two characters for that: U+000D U+000A (CR+LF). So ideally you'd convert the line breaks, too. But that gets a little more complex:
Get-ChildItem *.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}

Try the overloaded version ReadAllText(String, Encoding) if you are using ANSI characters and not only ASCII ones.
$contents = [IO.File]::ReadAllText($_, [Text.Encoding]::Default) -replace "`r`n", "`n"
https://msdn.microsoft.com/en-us/library/system.io.file.readalltext(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
ASCII - Gets an encoding for the ASCII (7-bit) character set.
Default - Gets an encoding for the operating system's current ANSI code page.

Related

Convert multiple CSV files to UTF-8 encoding using a script/Windows Command prompt

I am trying to create a script that converts the encoding of a collection of CSV files (10-20 files) in a directory into UTF-8 encoding. Currently, I am doing this manually by opening each individual file In NotePad+ and then switching the encoding to UTF-8, and then re-saving.
Are there any Windows commands or something else (I have Cygwin installed as well), that I could use to build a script to do this? Ideally, I would like the script to loop through every CSV file in the directory, and convert it into a UTF-8.
Thank you in advance for the help!!!
You're not specifying what to convert from, but assuming the input encoding is Windows-1252, try
for file in *.csv; do
iconv -f cp-1252 <"$file" >"$file".tmp &&
mv "$file.tmp" "$file"
done
This could leave some files unconverted (for example, if the input file contains bytes which are undefined in the source encoding) but will not overwrite the source file in this scenario. (Maybe disable the mv logic until you can see whether it works without errors.)
You can easily do that in PowerShell
Get-Content filename.csv | Set-Content -Encoding utf8 filename-utf8.csv
For your loop, you need to modularize your commands to where you can reference them and call them properly. In your case, you need to be calling "baseName" and appending ".csv" to it After that, simply using the right variables in the right places in ForEach loop will make it work.
$a = Get-ChildItem
ForEach ($item in $a) {
Get-Content $item.FullName | Set-Content -Encoding utf8 "$($item.Basename).csv.utf8"
}
Remember that before Powershell 6, Microsoft includes BOM ( Byte-Order Mark ). Three chars placed at the beginning of a file in the conversion.
The conversion needs to create an additional file that later on you can use mv for replacing the original.

Is there a PowerShell Get-Content Function to extract line based on the first character?

I am trying to extract each line from a CSV that has over 1million (1,000,000) lines, where the first character is a 1.
The 1 in this case, refers to the 1st line of a log. There are several different logs in this file, and I need the first line from all of them. Problem is (as you could understand) 1 is not unique, and can appear in any of the 12 'columns' of data I have in this CSV
Essentially, I would like to extract them all to a new CSV file as well, for further break down.
I know it sounds simple enough, but I cannot seem to get the information I need.
I have searched StackOverflow, Microsoft, Google and my own Tech Team.
PS: Get-Content 'C:\Users\myfiles\Desktop\massivelogs.csv' | Select-String "1" | Out-File "extractedlogs.csv"
The immediate answer is that you must use Select-String '^1 in order to restrict matching to the start (^) of each input line.
However, a much faster solution is to use the switch statement with the -File` option:
$inFile = 'C:\Users\myfiles\Desktop\massivelogs.csv'
$outFile = 'extractedlogs.csv'
& { switch -File $inFile -Wildcard { '1*' { $_ } } } | Set-Content $outFile
Note, however, that the output file won't be a true CSV file, because it will lack a header row.
Also, note that Set-Content applies an edition-specific default character encoding (the active ANSI code page in Windows PowerShell, BOM-less UTF-8 in PowerShell Core); use -Encoding as needed.
Using -Wildcard with a wildcard pattern (1*) speeds things up slightly, compared to -Regex with ^1.

Powershell equivalent of simple Perl regex search replace one liner to find replace in UCS-2LE or UTF-16 Little Endian file

This question is related to another one which went the perl way but found much difficulties due to Windows bugs. (see Perl or Powershell how to convert from UCS-2 little endian to utf-8 or do inline oneliner search replace regex on UCS-2 file )
I would like the POWERSHELL equivalent of simple perl regex on a little endian UCS-2 format file (UCS-2LE is same as UTF-16 Little Endian). ie:
perl -pi.bak -e 's/search/replace/g;' MyUCS-2LEfile.txt
You will probably need to tell Powershell gci that input file is ucs2-le and that you want output file in same UCS-2LE (windows CR LF) format also etc.
This will output the file after regex. The output file does -not- begin with a BOM. This should work for small files. For large files, it may require changes to be speedy.
$fin = 'C:/src/t/revbom-in.txt'
$fout = 'C:/src/t/revbom-out.txt'
if (Test-Path -Path $fout) { Remove-Item -Path $fout }
# Create a file for input
$UCS2LENoBomEncoding = New-Object System.Text.UnicodeEncoding $False, $False
[System.IO.File]::WriteAllLines($fin, "now is the time`r`nwhen was the time", $UCS2LENoBomEncoding)
# Read the file in, replace string, write file out
[System.IO.File]::ReadLines($fin, $UCS2LENoBomEncoding) |
ForEach-Object {
[System.IO.File]::AppendAllLines($fout, [string[]]($_ -replace 'the','a'), $UCS2LENoBomEncoding)
}
HT: #refactorsaurusrex at https://gist.github.com/refactorsaurusrex/9aa6b72f3519dbc71f7d0497df00eeb1 for the [string[]] cast
NB: mklement0 at https://gist.github.com/mklement0/acb868a9f15d9a34b6e88fc874b3851d
NB: If the source file is HTML, please see https://stackoverflow.com/a/1732454/447901

Converting unix script to windows script - emulating a Sed command in PowerShell

I have a unix script (korn to be exact) that is working well and I need to convert it windows batch script. So far I have tried inserting a powershell command line on my code, but it doesn't work. Please help, I am just new to both unix scripting and windows scripting so any help will do.
This is the line of code that I need to convert:
#create new file to parse ; exclude past instances of timestamp
parsefile=/tmp/$$.parse
sed -e "1,/$TIMESTAMP/d" -e "/$TIMESTAMP/d" $DSTLOGFILE > $parsefile
So far I have tried a powershell command line to be called on my script but it didn't work:
:set_parse_file
#powershell -Command "Get-Content $SCHLOGFILE | Foreach-Object {$_ -replace('1,/"$TIMESTAMP"/d' '/"$TIMESTAMP"/d'} | Set-Content $PARSEFILE"
Any suggestions please?
PowerShell has no sed-like constructs for processing ranges of lines (e.g., sed interprets 1,/foo/ as referring to the range of consecutive lines from line 1 through a subsequent line that matches regex foo)
Emulating this feature with line-by-line processing would be much more verbose, but a comparatively more concise version is possible if the input file is processed as a whole - which is only an option with files small enough to fit into memory as a whole, however (PSv5+ syntax).
Here's the pure PowerShell code:
$escapedTimeStamp = [regex]::Escape($TIMESTAMP)
(Get-Content -Raw $SCHLOGFILE) -replace ('(?ms)\A.*?\r?\n.*?' + $escapedTimeStamp + '.*?\r?\n') `
-replace ('(?m)^.*?' + $escapedTimeStamp + '.*\r?\n') |
Set-Content -NoNewline $PARSEFILE
Note that [regex]::Escape() is used to make sure that the value of $TIMESTAMP is treated as a literal, even if it happens to contain regex metacharacters (chars. with special meaning to the regex engine).
Your ksh code doesn't do that (and it's nontrivial to do in ksh), so if - conversely - $TIMESTAMP should be interpreted as a regex, simply omit that step and use $TIMESTAMP directly.
The -replace operator is regex-based and uses the .NET regular-expression engine.
It is the use of Get-Content's -Raw switch that requires PSv3+ and the use of Set-Content's -NoNewline switch that requires PSv5+. You can make this command work in earlier versions, but it requires more effort.
Calling the above from cmd.exe (a batch file) gets quite unwieldy - and you always have to be wary of quoting issues - but it should work:
#powershell.exe -noprofile -command "$escapedTimeStamp = [regex]::Escape('%TIMESTAMP%'); (Get-Content -Raw '%SCHLOGFILE%') -replace ('(?ms)\A.*?\r?\n.*?' + $escapedTimeStamp + '.*?\r?\n') -replace ('(?m)^.*?' + $escapedTimeStamp + '.*\r?\n') | Set-Content -NoNewline '%PARSEFILE%'"
Note how the -command argument is passed as a single "..." string, which is ultimately the safest and conceptually cleanest way to pass code to PowerShell.
Also note the need to embed batch variables as %varname% in the command, and since they are enclosed in embedded '...' above, the assumption is that their values contain no ' chars.
Therefore, consider implementing your entire script in Powershell - you'll have a much more powerful scripting language at your disposal, and you'll avoid the quoting headaches that come from bridging two disparate worlds.

PowerShell script to get UTF8?

We have a PowerShell script which creates an user in Microsoft Exchange and Active Directory.
We get the user's data by a preformated txt which serves as sort of CSV with:
$data = import-csv signup.txt
But the problem is that, as we are from Spain, sometimes it arises the character ñ which isn't picked up by the script and generates a bad username and bad data. So, we put it with N and then enter in Exchange and we change it from there again.
How can I fix that problem?
I recommend converting the file to UTF-8. Because the import-csv cmdlet works with it.
I usually create an empty file in notepad++ with UTF-8 encoding and copy the text from the other file.
Or as stated here
Get-Content signup.txt -Encoding Ascii | Out-File signup_utf8.txt -Encoding UTF8
Import-Csv signup_utf8.txt

Resources