I have been trying to parse the data from a text file that is generated by Teradata fast export utility.
The data looks like this:
Type2LRF|84|249
Job3|86|327
StageTOStageBackUp|85|327
When I have checked the character count of the garbage characters that is there is initially, it is 2.
I have been trying to parse the text file to remove the first 2 characters and generate a new text file out of it.
The new file should look like this:
Type2LRF|84|249
Job3|86|327
StageTOStageBackUp|85|327
I am trying to add the first 2 characters but they are not appearing correctly in the above block.
The Teradata fast export code that I am using is:
.LOGTABLE Informatica_Test.JobControlExport_log;
.LOGON server_name/dbc,dbc;
DATABASE Informatica_Test;
.BEGIN EXPORT SESSIONS 2;
.EXPORT OUTFILE "data.txt"
MODE RECORD FORMAT TEXT;
SELECT ((TRIM((COALESCE(J.JobName,''))))
||'|'||
(TRIM((COALESCE(JC.JobControlID,''))))
||'|'||
(TRIM((COALESCE(JC.Success_Source_Rows,''))))
)(TITLE '') from
Informatica_Test.JobControl JC
JOIN Informatica_Test.Job J
ON J.JobID = JC.JobID
JOIN Informatica_Test.BatchControl BC
ON BC.BatchControlID = JC.BatchCtrlID
where BC.BatchID = 1 -- This will be a parameter
and BC.EndDatetime = (select max(EndDatetime) from Informatica_Test.BatchControl);
.END EXPORT;
.LOGOFF;
#echo off
setlocal enabledelayedexpansion
break>test.txt
for /F "tokens=*" %%A in (data.txt) do (
set line=%%A
echo !line:~2! >>test.txt
)
I have tried the above code for removing the 2 characters.
Your exported data is VARCHAR so the first two bytes are the binary length of the string.
Instead of parsing/fixing the FastExport output file, use a different tool to export the data.
For larger numbers of rows, use Teradata Parallel Transporter (TPT) to export as delimited text (without the need for explicit concatenation or changing the file afterwards.
For small numbers of rows, use BTEQ EXPORT with REPORT format.
I have so much trouble with leading zeros in general. Importing into Sheets using JDBC connection, I haven't figured out a way to keep the zeros. The column types are varchar() for values of varied length, and char() for static length.
In the past with other data I have added a leading ' to values, or chosen to getDisplayValue() to keep them. What would work here?
while (results.next()){
var tmpArr = [];
var rowString = '';
for (var col = 0; col < numCols; col++) {
rowString += results.getString(col + 1) + '\t';
tmpArr.push(results.getString(col + 1));
}
valArr.push(tmpArr);
}
sheet.getRange(3, 1 , valArr.length, numCols).setValues(valArr);
Data Exmaple varchar column:
0110205361
0201206352
140875852
LFCP01367
LGLM00017
You are retrieving data into a Google Sheet from a MySQL database table using Jdbc. One of the database columns is formatted as "varchar" and includes some all-numeric values that have one or more leading zeros. When you update the database values to your Google Sheet, the leading zeros are not displayed.
Why
The reason for this is that the all-numeric values are displayed without the leading zeros is that the cells are formatted as Number, Automatic (or otherwise as a number). This means that they are 'interpreted' by Google Sheets as a number and, by default, all leading zeros are dropped.
On the other hand, if the cells are formatted as Number, Plain Text, then the all-numeric values are 'interpreted' as strings, and any leading zeros are retained.
The effect of formatting can be clearly seen in the following images, which also include istext and isnumber formula to confirm how they are interpreted under each format type.
Formatted as Number - Plain Text - treated as strings
Formatted as Number - Automatic - treated as numbers
Formatting on the fly
An alternative to pre-formatting (which wasn't successful in the OP's case) is to set the format as a part of the setValues() method using setNumberFormat
For example:
sheet.getRange(3, 1 , valArr.length, numCols).setNumberFormat('#STRING#').setValues(valArr);
There is a useful discussion of this methid in Format a Google Sheets cell in plaintext via Apps Script
i have a CSV in the below way. "India,Inc" is a company name which is single value which contains , in it
How to Get the Values in LINQ
12321,32432,423423,Kevin O'Brien,"India,Inc",234235,23523452,235235
Assuming that you will always have the columns that you specify and that the only variable is that company name can have commas inside, this UGLY code can help you achieve your goal.
var file = File.ReadLines("test.csv");
var value = from p in file
select new string[]
{ p.Split(',')[0],
p.Split(',')[1],
p.Split(',')[2],
p.Split(',')[3],
p.Split(',').Count() == 7 ? p.Split(',')[4] :
(p.Split(',').Count() > 7 ? String.Join(",",p.Split(',').Skip(4).Take(p.Split(',').Count() - 7).ToArray() ) : ""),
p.Split(',')[p.Split(',').Count() - 3],
p.Split(',')[p.Split(',').Count() - 2],
p.Split(',')[p.Split(',').Count() - 1]
};
A regular expression would work, bit nasty due to the recursive nature but it does achieve your goal.
List<string> matches = new List<string>();
string subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,23523452,235235";
Regex regexObj = new Regex(#"(?<="")\b[123456789a-z,']+\b(?="")|[123456789a-z']+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
matches.Add(matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
This should suffice in most cases. It handles quoted strings, strings with double quotes within them, and embedded commas.
var subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,\"Test End\"\"\",\"\"\"Test Start\",\"Test\"\"Middle\",23523452,235235";
var result=Regex.Split(subjectString,#",(?=(?:[^""]*""[^""]*"")*[^""]*$)")
.Select(x=>x.StartsWith("\"") && x.EndsWith("\"")?x.Substring(1,x.Length-2):x)
.Select(x=>x.Replace("\"\"","\""));
It does however break, if you have a field with a single double quote inside it, and the string itself is not enclosed in double quotes -- this is invalid in most definitions of a CSV file, where any field that contains CR, LF, Comma, or Double quote must be enclosed in double quotes.
You should be able to reuse the same Regex expression to break on lines as well for small CSV files. Larger ones you would want a better implementation. Replace the double quotes with LF, and remove the matching ones (unquoted LF's). Then use the regular expression again replacing the quotes with CR, and split on matching.
Another option is to use CSVHelper and not traying to reinvent the wheel
var csv = new CsvHelper.CsvReader(new StreamReader("test.csv"));
while (csv.Read())
{
Console.WriteLine(csv.GetField<int>(0));
Console.WriteLine(csv.GetField<string>(1));
Console.WriteLine(csv.GetField<string>(2));
Console.WriteLine(csv.GetField<string>(3));
Console.WriteLine(csv.GetField<string>(4));
}
Guide
I would recommend LINQ to CSV, because it is powerful enough to handle special characters including commas, quotes, and decimals. They have really worked a lot of these issues out for you.
It only takes a few minutes to set up and it is really worth the time because you won't run into these types of issues down the road like you would with custom code. Here are the basic steps, but definitely follow the instructions in the link above.
Install the Nuget package
Create a class to represent a line item (name the fields the way they're named in the csv)
Use CsvContext.Read() to read into an IEnumerable which you can easily manipulate with LINQ
Use CsvContext.Write() to write a List or IEnumerable to a CSV
This is very easy to setup, has very little code, and is much more scalable than doing it yourself.
becuase you're only reading values delminated bycommas, the spaces shouldn't cause an issue if you just treat them like any other character.
var values = File.ReadLines(path)
SelectMany(line => line.Split(','));
i mount a SMB path using this code
urlStringOfVolumeToMount = [urlStringOfVolumeToMount stringByAddingPercentEscapesUsingEncoding:NSMacOSRomanStringEncoding];
NSURL *urlOfVolumeToMount = [NSURL URLWithString:urlStringOfVolumeToMount];
FSVolumeRefNum returnRefNum;
FSMountServerVolumeSync( (CFURLRef)urlOfVolumeToMount, NULL, NULL, NULL, &returnRefNum, 0L);
Then, i get the content of some paths :
NSMutableArray *content = (NSMutableArray *)[[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:&error];
My problem is every path in "content" array containing special chars (ü for example) give me 2 chars encoded : ü becomes u¨
when i log bytes using :
[contentItem dataUsingEncoding:NSUTF8StringEncoding];
it gives me : 75cc88 which is u (75) and ¨(cc88)
What i expected is the ü char encoded in utf-8. In bytes, it should be c3bc
I've tried to convert my path using ISOLatin1 encoding, MacOSRoman... but as long as the content path already have 2 separate chars instead of one for ü, any conversion give me 2 chars encoded...
If someone can help, thanks
My configuration : localized in french and using snow leopard.
urlStringOfVolumeToMount = [urlStringOfVolumeToMount stringByAddingPercentEscapesUsingEncoding:NSMacOSRomanStringEncoding];
Unless you specifically need MacRoman for some reason, you should probably be using UTF-8 here.
NSMutableArray *content = (NSMutableArray *)[[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:&error];
My problem is every path in "content" array containing special chars (ü for example) give me 2 chars encoded : ü becomes u¨
You're expecting composed characters and getting decomposed sequences.
Since you're getting the pathnames from the file-system, this is not a problem: The pathnames are correct as you're receiving them, and as long as you pass them to something that does Unicode right, they will display correctly as well.
Well, four years later I'm struggling with the same thing but for åäö in my case.
Took a lot of time to find the simple solution.
NSString has the necessary comparator built in.
Comparing aString with anotherString where one comes from the array returned by NSFileManagers contentsOfDirectoryAtPath: is as simple as:
if( [aString compare:anotherString] == NSOrderedSame )
The compare method takes care of making both the strings into a comparable canonical format. In effect making them "if they look the same, they are the same"
I've looked at the other ruby/encoding related posts but haven't been able to figure out why the following is not working. Likely just because I'm dense, but here's the situation.
Using Ruby 1.9 on windows. I have a set of CSV files that need some data appended to the end of each line. Whenever I run my script, the appended characters are gibberish. The input text appears to be IBM437 encoding, whereas my string I'm appending starts as US-ASCII. Nothing I've tried with respect to forcing encoding on the input strings or the append string seems to change the resultant output. I'm stumped. The current encoding version is simply the last that I tried.
def append_salesperson(txt, salesperson)
if txt.length > 2
return txt.chomp.force_encoding('US-ASCII') + %(, "", "", "#{salesperson}")
end
end
salespeople = Hash[
"fname", "Record Manager"]
outfile = File.open("ActData.csv", "w:US-ASCII")
salespeople.each do | filename, recordManager |
infile = File.open("#{filename}.txt")
infile.each do |line|
outfile.puts append_salesperson(line, recordManager)
end
infile.close
end
outfile.close
One small note that is related to your question is that you have your csv data as such %(, "", "", "#{salesperson}"). Here you have a space char before your double quotes. This can cause the #{salesperson} to be interpreted as multiple fields if there is a comma in this text. To fix this there can't be white space between the comma and the double quotes. Example: "this is a field","Last, First","and so on". This is one little gotcha that I ran into when creating reports meant to be viewed in Excel.
In Common Format and MIME Type for Comma-Separated Values (CSV) Files they describe the grammar of a csv file for reference.
maybe txt.chomp.force_encoding('US-ASCII') + %(, "", "", "#{salesperson.force_encoding('something')}")
?
It sounds like the CSV data is coming in as UTF-16... hence the puts shows as the printable character (the first byte) plus a space (the second byte).
Have you tried encoding your appended data with .force_encoding(Encoding::UTF-16LE) or .force_encoding(Encoding::UTF-16BE)?