Awk Split String Print Then Print Again


Aye, was going to postal service the aforementioned if someone hadn't already. This is the unix way. Small tools with narrow focus, strung together in pipelines.

>> This is the unix style. Small tools with narrow focus, strung together in pipelines.

> ah aye, awk's famously narrow focus, [...]

"GNU'south Non Unix"


And GNU is what nearly people blitz for installing on macOS and Windows which says a lot about the actual needs of people Vs the supposed benefits of Unix purism

awk is a tool for handling records in text. csv is a textual format for records.

I'd argue that needing a different tool for handling records in text so that you can laissez passer information technology to a tool for handling records in text is a bit too far.


I think getting awk to recognize that a field separator inside a quoted string should be ignored is a not bad addition. This is not inconsistent with the "unix way." Many, many unix tools recognize that a quoted cord should exist treated equally a separate entity. The more than unix-like approach would have been to force users to remove quotes if they want awk to split strings based on field separators within quotes. In hindsight, I'grand surprised the quote respecting option wasn't added a long time ago.

I have grown fond of using miller[0] to handle control line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (virtually operations) are not limited past memory.

[0]: https://github.com/johnkerl/miller

xsv does similar stuff for CSV, and very rapidly: https://github.com/BurntSushi/xsv

https://miller.readthedocs.io/en/latest/why/ has a nice section on "why miller":

> First: in that location are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and and so on -- simply I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the aforementioned except for format. Then the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (nonetheless represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.


Out of curiosity, I tried these ii on that same information & computer as https://news.ycombinator.com/item?id=31356573 , mlr --c2t cat takes 2.96 seconds while xsv true cat rows to /dev/null takes 0.434 seconds. And then, 14.8X and 2.17X slower than that c2tsv Nim program to do exactly (and simply) that conversion. But, yes, yes I am sure perf varies depending on quoting/escaping/cavalcade/etc. densities.

For completeness, just one CPU/automobile, but a contempo checkout of zsv 2tsv (built with -O3 -march=native) on that same file/aforementioned computer seems to take 0.380 sec - almost 2X longer than c2tsv'southward 0.20 sec (built with -mm:arc -d:danger, gcc-xi), only zsv 2tsv does seem a lilliputian faster than xsv true cat rows.

OTOH, zsv count but takes 0.114 sec for me (only of course, every bit I'1000 sure you know, that also only counts rows not columns which some might mutter about). { EDIT: and I've never tried to time a "parse only" way for c2tsv. }


BTW does c2tsv handle multibyte UTF8, \r\n vs \n, regular escapes e.chiliad. embedded dbl-quote/nl/lf/comma, as well as CSV garbage that doesn't be in theory only is abundant in real world (due east.g. dbl-quote inside a cell that did not start west dbl-quote, malformed UTF8, etc)? Treatment those in the same way Excel does added considerable overhead to zsv (and is the reason it could only perform a subset of the processing in SIMD and had to utilise regular branch code for other)

It handles most cases, but perhaps not arbitrary garbage that humans might be able to gauge, only I don't remember rfc4180 includes all those anyway. c2tsv is UTF8/binary agnostic. It just keys off ASCII commas, newlines, etc. Beats me how one ensures you handle anything the "aforementioned" way Excel does without actually running Excel'south lawmaking somehow. { Mayhap today, but adjacent year or 10 years agone? } The niggling land machine could be extended, only it's hard to approximate what the speed impact might be until you actually write said extensions.

From a functioning perspective, strictly delimiter-separated values { again, ironically redundant ;-) } can be parsed with memchr. On Linux, memchr should be SIMD vectorized at to the lowest degree on x86_64 glibc via ELF 'i' symbols. Then, while you lot surrender SIMD on the "messy role" with a byte-at-a-time DFA, you regain information technology on the other side. (I have no idea if Apple tree gives you SIMD-vectorized memchr.)

Transport to a file and sectionalization (for parallel handling of segments) is besides a elementary application of memchr rather than needing an alphabetize of where rows outset. Y'all just split by bytes and discover the next newline char. (Roughly). This can get you xvi..128X speed-ups (today, anyway, on simply 1 host) depending upon what you lot do.

Conversion to something properly byte-delimited basically restores whatever charm you might accept thought ?SV had. I can just imagine a few corner cases where running directly off a complex format similar quoted CSV makes sense ("tiny" data, "cannot/volition not spend 2X space+must save input", "cannot/will not spend time to recompress", "running on a network fileysystem shared with those who reject simplicity".) These cases are not common (for me). When they do happen, perf is usually limited by other things similar network IO, start-upwardly overheads, etc. Ordinarily that piddling actress fleck to write buffers out to a pipeline will either non thing or exist outright immediately repaid in parallelism, parsing simplicity, or both.

Converting from any ASCII to even faster binary formats has a similar story, simply usually with even more perf improvement (depending..) and more than "choices" like how to represent strings [1]. Fully pre-parsed, the performance of conversion matters much less. (Any the ratio of processings per initial parse is.) Between both parallelism and ASCII->binary, however fast you lot brand your serial zsv parser/ETL stuff, bodily data analysis may still run 10,000 times slower than it could exist on just 1 CPU (depending upon what throttles your workloads..you may but get 10000x for CPU local L1 resident nested loop stuff). { But nosotros veer at present toward trying to cram a databases grade into an HN comment. :) And I'thou probably repeating myself/others. Direct email from here may work improve. }

[i] https://github.com/c-blake/nio


thanks for mentioning. will attempt out. did you use the default build settings for zsv (i.eastward. merely plain onetime "make install")? likewise do you have copy or location for the dataset you lot used to test on? too what hardware / bone if I may ask?

A big problem is tooling - what they currently support, and the fact that excel will f*ck up anything.

I suppose the best way would be a suite of tools to edit them that are uniform with existing editors? TBH a great value-add together for adoption would be versioning south.t. you can come across (and revert) when other tools mess upwards your files.

I'm just gonna exit this here.

https://git.sr.ht/~sforman/sv/tree/master/item/sv.py

(Python code for ASCII-separated values. TL;DR: information technology's so stupid simple that I experience whatever objection to using this format has just got to be wrong. It's like "Hey, we don't accept to go on stabbing ourselves in the face." ... "Just how will we use our knives?" See? It'southward similar that.)

> any objection to using this format

How do I edit the data file in a text editor?

More generally, how practice I edit the entire data file, including field/record/etc seperators, in a editor that can display only characters that are valid content of data fields?

(Not that CSV (disallowment de-facto-nonstandard extensions like rfc4180'south `"b CRLF bb"` and `"b""bb"`) is whatever expert for this either, of course.)

indeed, if the text editor doesn't care, information technology'south no aid. Same for all other characters - be it tab, cr, lf, crlf, ä, and so on.

Choose your tools wisely, I'thousand not aware the situation.


Could ane get away with regular AWK using these combined with a streaming to/from CSV converter?

Aye and I doubtable that this is why it was never added to AWK. I,and I am sure nearly people, accept an AWK filter to transform their csv or whatever format to a format that AWK can apply with an appropriate FS and RS.

Their "but why" section should really go into more detail well-nigh why filters are not 100% if your data has the possibility of containing your preferred line/record separator.

But real earth, I have never had a situation where a csv-to-awk filter (written in awk of course) did not work.


Closely related- I remain mystified equally to why FIX Protocol chose to use control characters to split up fields, but used SOH rather than RS/Usa or the like.

Ben this is slap-up, thank you. Would you lot consider calculation Unicode Separated Values (USV)?

https://github.com/sixarm/usv

USV is like CSV and simpler because of no escaping and no quoting. I tin donate $l to you or your charity of option as a token of cheers and encouragement.

> no escaping

That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. simply they are nonetheless characters that may appear in the data. How do you stand for a value containing ␟ Unit Separator in USV?

In-band signalling is ever going to remain in-ring signalling. And in-band signalling will demand escaping.

USV has no escaping on purpose because it's simpler and faster.

USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.

If you demand escaping, and so I can suggest you lot consider USVX which is USV + extensions for escaping and other conveniences, or you lot can apply your content's ain escaping such equally ampersand escaping for HTML text, or you lot tin utilize any other format such every bit JSON, XML, SQL, etc.

> USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.

I agree that wanting to put one of these characters in a csv is super rare. The vast majority of use cases would not notice that certain inout characters are prohibited. Simply sure inout characters _are_ prohibited, which ways that encoders must check this case, and tools using those encoders neex to handle its failure.

In a way, the fact that failure will be super rare will make it more dangerous, because people will omit these checks considering "it won't happen for them" -- and so at some point information technology will wreak havoc.

And with USVX all we've gained is less backslashes but more bytes per separator, and then for usual data that doesn't comprise a comma in every field, the USVX encoding will fifty-fifty increase file size, without requiring less lawmaking anywhere.

I acknowledge there is no ideal solution, though; while out-of-ring signalling (i.due east. length-prefixing) avoids all of these issues (which is why binary formats are universally length-prefixing formats), escaped formats are work much improve for humans. And if a homo wouldn't demand to read it, one wouldn't use csv anyway, normally.

You just reminded me of the argument against seat belts based on the supposition that they'll encourage riskier driving thereby leading to more deaths.

Not doing the plain right superior piece of cake strategy because there's some infrequent weird corner example prevents whatsoever kind of progress. Perfection is the enemy of good enough, engineering is balancing constraints, advertising nauseum...


This just creates another "CSV" were you lot don't know which format variation was actually used.

> In-band signalling is [n]ever going to remain in-band signalling

This is the root of all spreadsheet evil, right hither.

TSV. Replace/strip/ban tab + newline from fields when writing. Done. If y'all demand those characters, encode them using backslashes if you lot absolutely must (i.due east. you're an Excel weenie doing Excel weenie things that actually aren't "spreadsheet" things but since y'all only know of one hammer you and then use that hammer to exercise everything)

That has the same in-ring problems every bit existing specs, which is that things generating the data need to have that callout.

There are many cases where the "solution" to the CSV inband signalling problem was to just reject values with commas, because they should never come in and if they always do they should be investigated because they weren't valid data for whatever the CSV was storing. The whole problem is that programmers don't call back to do that. The siren call of the string suspend part is just too strong, specially when programmers don't fifty-fifty realize they should be resisting.


Thanks. I haven't tried this, but it should actually work already in standard AWK input mode with FS and RS set to those Unicode separators. I'll test it tomorrow when dorsum at my laptop.


Yep, this works fine (in GoAWK, gawk, and mawk, though non in original awk):

                                                              $ cat t.usv && repeat   id␟name␟historic period␞1␟Bob "Baton" Smith␟42␞2␟Jane   Brown␟37   $ goawk -F␟ -vRS=␞ -vOFS=, '{ print $1, $2, $3 }' t.usv    id,name,age   1,Bob "Baton" Smith,42   2,Jane   Brown,37                                                          
I've likewise added explicit tests for ASCII and Unicode unit and record separators, just to ensure I don't regress: https://github.com/benhoyt/goawk/commit/215652de58f33630edb0...

When you have "format wars", the best idea is unremarkably to have a converter program modify to the easiest to work with format - unless this incurs a infinite explosion equally per some image/video formats.

With CSV-similar data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing and so squash many bugs written by "information annotator/scientist coders". Second, in a trounce "a|b" runs programs a & b in parallel on multi-cadre and allow things similar csv2x|head -n10000|b or popen("csv2x foo.csv"). Tertiary, majority conversion to a random admission file where literal delimiters cannot occur every bit non-delimiters allows niggling file partition to be nCores times faster (under frequently satisfied assumptions). There are some D tools for this bulk catechumen in https://github.com/eBay/tsv-utils and a much smaller stand up-lone Nim tool https://github.com/c-blake/nio/blob/principal/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if at that place is no quote anywhere? Fourth, past using a program as the unit of modularity in this instance, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the trouble "once and for all" on a given CPU. The problem is actually just unproblematic enough that this smells possible.

I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Possibly call information technology "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or betoken to one? Information technology can exist just equally "general/lossless" (see https://news.ycombinator.com/item?id=31352170).

Of form, if you are going to do a lot of data processing against some data, information technology is even ameliorate to parse all the way to down to binary then that you never have to parse again (well, unless y'all telephone call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

Someone linked a wikipedia format guide for TSV, merely the world seems to have settled on using the escape codes \\, \t, \n with their obvious meanings, and then assuasive arbitrary binary.

That should be parallelism friendly, fifty-fifty with UTF-8, where an ascii tab or newline byte always mean tab and newline.


That someone was me. I don't retrieve of "could change at whatever time Wikipedia" as "as authoritative" as the "document" should exist. :-) { EDIT: and I very much agree information technology is friendlier in almost whatever thinkable style except maybe Excel might non support it. Or perchance it does? You do need to unescape binary fields at a "higher level" of usage, of course, when delimiting is no longer an effect. Also, merged my posts. }

A fast streaming converter into my suggested "DSV" can as well be faster cease-to-end. These kinds of things tin vary a lot based upon how many columns rows take. I could not notice "huge.csv". So, to be specific/possibly reproducible, using the 151492068 bytes of data from here:

                                                              http://burntsushi.net/stuff/worldcitiespop.csv                                                          
put into /dev/shm and making a symlink to huge.csv then using the csvbench.sh in the goawk distro, I got (best of 3 elapsed times):
                                                              Python     2.88user 0.01system 0:02.91elapsed 99%CPU (9460maxresident)k     Goawk      1.15user 0.07system 0:01.11elapsed 110%CPU (8460maxresident)k     Go         0.86user 0.04system 0:00.84elapsed 107%CPU (7300maxresident)k                                                          
Go vs. Goawk time ratios were similar to Ben's commodity just inverted. Probably a number of columns outcome. frawk failed to compile for me because my Rust was not new enough, according to the mistake messages. On the aforementioned data:
                                                              c2tsv+gawk 0.22user 0.04system 0:00.57elapsed 46%CPU (2680maxresident)k     (as *c2tsv<huge.csv|gawk -F'\t' '{nfs+=NF} END{impress NR, nfs}'*)     c2tsv+mawk 0.62user 0.09system 0:00.46elapsed 155%CPU (2780maxresident)k     (as *c2tsv<huge.csv|mawk -F'\t' '{nfs+=NF} End{print NR, nfs}'*)     c2tsv V2   0.19user 0.07system 0:00.23elapsed 115%CPU (2680maxresident)one thousand     (as *c2tsv<huge.csv|wc -l*, very similar to c2tsv<huge.csv>/dev/cipher)     c2tsv V3   0.19user 0.00system 0:00.20elapsed 99%CPU (2632maxresident)k     (as *c2tsv<huge.csv>/dev/null*)                                                          
A little 100 line Nim program combined with a standard utility seems to be about 4x faster (0.84/0.23) than Get results even though said program writes out all the data again. How can this be? Well, my piping IO is usually around 4.four GB/s (equally assessed by a dd piped to a read-just sink) while 151e6/.2=only 755 MB/s. Then it need just utilize ~17% of available piping BW.

I alway just use awk to process csv files.

                                                          awk -F '^"|","|"$|,' '{print $2,$3}' whatsoever.csv                                                      
The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.

That's the thing about AWK, it's already does everything. No need to extended it much at all.


That'due south written in Python and uses the "agate" library which uses Python'south born "csv" module. I did a couple of uncomplicated benchmarks confronting Python'south csv module in the commodity: https://benhoyt.com/writings/goawk-csv/#operation (Become/GoAWK is a flake more than 3x equally fast)

I also did a quick examination using csvcut to pull out a unmarried field, and compared information technology to GoAWK. Looks like GoAWK is nearly 4x every bit fast hither:

                                                              $ fourth dimension csvcut -c agency_id huge.csv >/dev/null    existent 0m25.977s   user 0m25.240s   sys 0m0.424s   $ time goawk -i csv -H -o csv '{ print @"agency_id" }' huge.csv >/dev/null    real 0m6.584s   user 0m7.434s   sys 0m0.480s                            

I meant more in terms of user/developer experience.

I imagine there's some utilise case where AWK will be able to come up with output that csvkit can't.

Simply for simple cases, csvkit's invocation is easier to remember.


Ah, true -- that would be a good comparison to exercise, or section to add the the docs (csv.md in the repo). I'll add together information technology to my to-practise list!

In my country (Spain) we traditionally use commas as a decimal separator, but I think CSV should not back up this.

The way I see it, CSV'due south purpose is information storage and transfer, not presentation.

Presentation is where y'all make up one's mind the font face, the font size, jail cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.

In data transfer and storage information technology's much more important that the parsing/generation functions are as simple as possible. Then let's go with one numeric format only. I call back we should utilise the dot as a decimal separator since it'south the most common separator in all programming languages. Maybe extend it to include exponential notation equally well, considering that is what other languages like json support. But that's it.

I hold the aforementioned stance about dates tbh.

(The same goes for dates, btw - yyyy-mm-dd or death)

For time formats, use ISO 8601, and exist certain to suspend a Z to the end to denote UTC (zulu time), and consider rejecting timestamps without information technology:

https://en.m.wikipedia.org/wiki/ISO_8601

The but real trouble with that format is the distasteful "T" in the center (but, hey, at to the lowest degree information technology is whitespace-free!)


The other distasteful thing is getting your easily on the bodily ISO 8601 specs is pretty expensive, similar 350 CHF, which is weird, because so many national standards bodies gave it official blessing, and and then many things in the digital universe depend on it. Maybe that 'T' is optional afterwards all!


What do you mean past csv shd not support this. Practise you mean it was a historical mistake or new parsers shdnt exist able to read information technology?

My opinion is that it was a historical mistake. I think if the format would accept been conspicuously specified in Microsoft Excel humanity would have saved billions in parsing labour.

On the question wether new parsers shouldn't be able to read information technology... if I take to build them or maintain them, and so *those* parsers shouldn't be able to read it. I don't desire to accept to deal with the whole messiness of the thing, and I know how to modify Excel's locale to generate CSV in a reasonable format. If it's something that someone else maintains and I can mostly ignore the crazy number formatting as well as other quirks, I could live with it. Only I would nonetheless prefer if they didn't implement whatever of it, because the extra code could brand the parts that I need worse (for example, by provoking a segfault, or making a line ambiguous).

Fifty-fifty if it'south parsers that other people use and I never use, I still would prefer if they didn't do information technology, because that would increment the overall possibility of me having to bargain with another (sigh) semicolon-separated CSV file.


Can we please, delight, stop that? If we go Metric, will you please standardize on period every bit decimal betoken? Milk shake on it. Let'due south make it happen.

Somebody told me that the official international standard prefers comma; a period is just a permitted culling. Of course, comma as decimal separator isn't very uniform with programming languages.

Nonetheless, the decimal separator that I was taught at primary school and have used in handwriting all my life is neither comma, nor menstruum, only '·'.

What, why? I've seen comma decimal separated values just I thought it was a weird English language thing. What do these places use to separate larger values (e.chiliad. each x^3 magnitude increase) to make manual counting easier.

828.497.171.614,2?


Yep, using a dot for group is a common solution. Merely it seems to be fading out because of the conflict. Other solutions are a space (12 345,67) or apostrophe also known as "highcomma" (12'345,67).


Exactly. The "official way" of writing big numbers around here (I know at to the lowest degree the balkans, deutschland and austria) is `1.000.000,00 €` (one million). And then what you wrote, but always at least 2 decimal places (or zero if it's not a financial document).

Doh, Having two opposite conventions is a PITA.

But if I add my ii cents - I prefer one,000.00 because I remember it is like to commas and full-stops in writing, where a comma is mainly a reading/speaking guide to help break up a judgement (or in this instance a number) and a full end is a much harder termination of a sentence (in the case of a number the official origin where int/non-partial part ends).


What is worse is the ignorance on the English-speaking side. We have been putting up with your stuff for a few decades at present, and nigh the entire population needs to know about it, while we run into near-zero attempts to suit from your side.


I'd argue the decimal separator is more important and should therefore be more visible; the comma is larger than the dot and extends below the baseline so is easier to see.


You could argue that both ways. "Here'south a large amount. That's a moderate amount. And that is a small-scale corporeality, with a tiny extra bit." :)

Frg has been metric forever but has the comma equally a decimal separator, and the dot as a thousands group separator. It'south securely engrained in our typography, it has typographical idioms such as ,— (comma em-dash), up to the signal that fifty-fifty brands have adopted information technology in their brand name.

I'thousand afraid we tin can't really change that on a whim. Why would we, anyway?

" Why would we, anyhow? "

To be compatible with the world, so Germans can use international software, and could sell software to international people.

America is not the world; at least in the western world[0] at that place are more countries and languages that follow the German language convention. Also, at that place is more than to the earth than software (although that seems to exist changing).

[0] Yes, that shows my limited perception.


Call me crazy, but I'g nether the impression that Germans do use international software, and that Germany may even be dwelling to the quaternary largest software company in the world.


There are more than countries using the comma than the dot. Why should the majority follow a minority in that case?


Are yous certain that more people, numerically speaking, apply the comma as a decimal separator? India and China do not, although admittedly Bharat has some other decimal grouping system.


If we are aiming to create a mutual international system of representing the partial function of a decimal number, it makes sense to start with the system that is already in use past the virtually people.

No, information technology does not. The Countries individual choice has more weight than their number of citizens. It's probably as well more work to modify laws and regulations, than letting people adapt to the changes.

This is basically the classical problem of autonomous systems, when they need to balance the interests of different sized groups. Numbers solitary don't make a off-white solution. And unless you have the power to strength them, you volition not convince everyone to follow you merely by arguing with numbers anyway.

Yous are the one who proposed that the bulk should not follow the minority, but now you are as well saying that the minority should not follow the majority either. If your point is that nobody should follow anybody because freedom is more important than standardization, that's fine, but then it would have been clearer to just say that in the get-go place.

In my opinion, when we are talking nigh a data interchange format similar CSV, having a elementary, common format would exist far more practical and efficient than allowing each state to decide for itself its own standard. Having dealt with exactly this problem in a global SaaS product where a minority of clients submitted CSV files with commas for decimal separators, I can say it would have fabricated the parsing code a lot simpler and more robust if our system (and countless others like information technology) did not need to build in exceptions for this minority apply case.

No. I concur it should be universally used for car-readable data formats, but for human consumption there is no need to be anglo-centric.

You guys are already slowly encroaching on milliards and billiards in other languages with your illogical brusk scale numbers.


Don't worry, some of us are still fighting for the gloriously base of operations-2 US customary system; base-10 is only for people who don't float. And the 30cm foot.


Well, in a way a base-ii system would have been far alee of its time... it'due south not consistently base-2 though.


Off-white comeback. I call back of CSV every bit modern, only Wikipedia tells me it's virtually equally old every bit AWK (depending on how you count). It seems to me it's used more heavily at present as an commutation format, compared to say fifteen-xx years ago, but I could be wrong.

JSON is an exchange format... sqlite is an commutation format... even protocol buffers are an commutation format...

CSV is only an substitution format if there is no user generated strings in the data... If there are, then you'll near certainly spiral up the encoding when someones name has a newline or comma or speech communication marking in it, or some obscure unicode etc. Fifty-fifty moreso if awk is role of your toolkit.


That may have been more than true years ago, simply now quoting is pretty well defined with RFC 4180, and most tools seem to produce and eat RFC 4180-compatible CSV (which properly handles commas, quotes, and fifty-fifty newlines in fields). That said, at that place still are too many non-standard or quirky CSV files out in that location.

> and almost tools seem to produce and consume RFC 4180-uniform CSV

Laughs in SSIS…

In that location are some significant tools (or common add-ins for them) that don't entirely respect RFC4180. Though I run across few files that alienation it these days, thre are tools that intermission with conforming files (looking at you, Excel, trying to be clever about anything isn't conclusively provable not to be a date).

Our clients utilise it all the time, to the indicate where we'd lose sales if we didn't support it, simply CSV is far from a condom style to transport data IMO. Each time a new requirement to deal with CSV comes in I treat information technology every bit a custom format that may or may not exist something like RFC4180.


I see lot of CSVs generated with LF instead of CRLF equally line endings. Blindly post-obit the RFC is still probably not advised.


It is a overnice addition, but I would similar to see this taken further - structural regular expression awk. Information technology is waiting to be implemented for 35 years at present.

>A large thank-yous to the library of the Academy of Antwerp, who sponsored this characteristic. They're one of ii major teams or projects I know of that utilise GoAWK – the other ane is the Benthos stream processor.

That'south swell to hear.

Are yous planning to add together support for xml, json, etc next? Something similar Python'due south `json` module that gives y'all a dictionary object.


I'm not considering adding general structured formats like XML or JSON, as they don't fit AWK'due south "fields and records" model or its simplistic data structure (the associative array) very well. Notwithstanding, I have considered adding JSON Lines support, where each line is a tape, and fields are indexable using the new @"named-field" syntax (maybe nested like @"foo.bar").


What'due south the best resource for learning modern awk these days? I've used it for decades, simply only via memorized snippets…


For not-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes tin handle (cut -d\034, etc). It'due south worth setting up every bit an allonym, and you can store \034 in $D or whatever for convenience.


During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.

JSON Lines would probably shell that out: https://jsonlines.org/

I phrase that carefully. "Improve"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, about definitely yes.

A solid upside is a single encoding story for JSON. CSV is a mess and can't exist un-messed now. Size bloat from endless repetition of the object keys is a pregnant disadvantage, though.

I still think objects are great, but PowerShell makes it and then hard to bargain with them.

I think F#-interactive (FSI) with its Hindley-Milner blazon-inference, would have been a much better base for a crush.


I'one thousand not familiar with F#, but I do hate CSV tools that endeavour type inference on data; in my opinion the csvkit tools should have the -I selection on by default.

F# does type inference similar this:

                                                          allow x = one;                                                      
10 is at present an integer. Now you lot tin't do
                                                          printfn "%s" 10;                                                      
Only
                                                          printfn "%i" ten;                                                      
So it has strong static typing, but virtually of the time you don't demand to be explicit about them. Information technology tin can even infer function types.

If you use FSI (F# interactive( it will always impress the signatures in betwixt, so that it's really easy to explore interfaces.

> I all the same think objects are great, just PowerShell makes it so hard to bargain with them.

LOL, this is amazing...


The thing is, I've used .NET a lot, and C# and F# I tin code in my sleep. The same object system, integrated in PowerShell, makes information technology really difficult to use.

That is even more than astonishing.

What makes it "and so hard". $object.Proprety or $object.Method() is the same. new versus new-object? [type] vs type ?


Those parameters are URL/Header/descriptor parameters. They don't live in the CSV itself.


I meant to have these parameters as awk options, supplied on control line, from envvar, or perchance even as awk variables.


Gnu awk also has a csv extension that comes with gawkextlib. I retrieve it may even be installed on many Linux distros by default.

I tin can't tell whether the UNIX people have lost their fashion, or just the demands of modern shell scripts cannot be met by typical trounce philosophy - that is, piping together the output of pocket-size, orthogonal utilities.

The emergence and constantly increasing complication of these small-scale, bespoke DSLs like this or jq does not inspire conviction in me.

> demands of modern shell scripts cannot be met by typical shell philosophy

That. Pipes and unstructured binary data isn't compositional plenty, making the divide between the kinds of things you can express in the language you lot utilize to write a phase in a pipeline and the kinds of things you lot can express by building a pipeline too big.


you made me think a possible corollary (?) question would exist if the json people don't perfectly overlap with the unix people.

From aforementioned documentation merely the "more CSV" link:

>In full general, using FPAT to do your own CSV parsing is like having a bed with a blanket that'southward not quite big enough. There's always a corner that isn't covered. Nosotros recommend, instead, that you utilize Manuel Collado's CSVMODE library for gawk.


A practiced and useful improver. There's a mention to CSVMODE, a gawk library. I wonder if it could be extended to support the functionality that goawk'southward `-i csv` has.


cheers for this. am looking at the benchmarks. how do I get huge.csv? Don't see how to fetch or generate

FYI I ran on the worldcities data at https://github.com/petewarden/dstkdata (credit to xsv for choosing that dataset) against https://github.com/BurntSushi/xsv and https://github.com/liquidaty/zsv (full disclosure: I am i of the zsv authors). Here'due south what I got.

fastest to slowest: zsv (0.07), xsv (0.16), goawk (0.42), python (~ane.6)

Obviously, does not tell the whole story as this test was limited to "count" and an interpreted language is expected to always be slower compared to a precompiled command, but, it might be relevant to a user deciding what tool to employ. Also, might be instructive as to room for improvement in the go lawmaking (or peradventure the go lawmaking could utilize the c lib)-- I notation that even if the goawk control is '{}' the runtime is yet about the same.

full results:

---

goawk:

1000001 7000007 existent 0m0.435s user 0m0.435s sys 0m0.031s

1000001 7000007 real 0m0.413s user 0m0.419s sys 0m0.024s

1000001 7000007 real 0m0.425s user 0m0.430s sys 0m0.024s

xsv:

1000000 existent 0m0.157s user 0m0.141s sys 0m0.013s

1000000 real 0m0.156s user 0m0.141s sys 0m0.012s

one thousand thousand existent 0m0.158s user 0m0.142s sys 0m0.013s

zsv:

1000000 real 0m0.066s user 0m0.053s sys 0m0.010s

1000000 real 0m0.077s user 0m0.060s sys 0m0.012s

1000000 real 0m0.069s user 0m0.056s sys 0m0.010s

python:

1000001 7000007 existent 0m1.589s user 0m1.553s sys 0m0.026s

1000001 7000007 existent 0m1.583s user 0m1.550s sys 0m0.025s

1000001 7000007 real 0m2.122s user 0m1.675s sys 0m0.037s

---

The script for this was:

---

echo 'goawk:'

(time goawk -i csv '{ westward+=NF } Stop { print NR, west }' < worldcitiespop_mil.csv) two>&one | xargs

(time goawk -i csv '{ w+=NF } Stop { print NR, west }' < worldcitiespop_mil.csv) 2>&ane | xargs

(fourth dimension goawk -i csv '{ w+=NF } END { print NR, w }' < worldcitiespop_mil.csv) 2>&one | xargs

repeat 'xsv:'

(time xsv count < worldcitiespop_mil.csv) 2>&1 | xargs

(time xsv count < worldcitiespop_mil.csv) 2>&1 | xargs

(fourth dimension xsv count < worldcitiespop_mil.csv) 2>&1 | xargs

echo 'zsv:'

(time zsv count < worldcitiespop_mil.csv) 2>&1 | xargs

(time zsv count < worldcitiespop_mil.csv) 2>&1 | xargs

(fourth dimension zsv count < worldcitiespop_mil.csv) ii>&ane | xargs

echo 'python:'

(fourth dimension python3 count.py < worldcitiespop_mil.csv) two>&1 | xargs

(time python3 count.py < worldcitiespop_mil.csv) two>&1 | xargs

(fourth dimension python3 count.py < worldcitiespop_mil.csv) 2>&i | xargs

Thank you for that! zsv looks astonishing. Aye, it'southward definitely going to whip GoAWK, what with SIMD parsing and careful memory handling. I've done a couple of basic things for GoAWK'southward CSV functioning, but haven't profiled or looked at allotment bottlenecks (absolute functioning definitely wasn't my commencement focus).

Yeah, distressing about huge.csv -- I found it online somewhere originally by searching for something like "large csv example", simply can't for the life of me discover it now. It'south a monster 1.5GB file with 286 columns including quoted fields (whereas worldcitiespop just has a few columns and it doesn't look similar it has quoted fields). I can upload to a file transfer service and transport a link to you if you want ... though I should really update my benchmarks to use an easily-downloadable file instead.

If you're but using AWK features, mawk is still the fastest. GoAWK is faster than awk in virtually cases, on a par with gawk, but still not as fast as mawk (see https://benhoyt.com/writings/goawk-compiler-vm/#virtual-mach...) ... except for scripts that make heavy utilize of regexen. Unfortunately Go's regexp package is nonetheless quite slow. You lot could too attempt frawk, which is a JIT-optimized AWK written in Rust -- it'southward really fast, and shares some of the CSV features (merely not the @"named-field" syntax).

But for well-nigh everyday usage, even for big inputs, GoAWK's performance is quite sufficient. The CSV support, and its use as a Go library -- they're more than of import to me than raw speed at this point.


Only use 9front. The plan9 C borrows a lot of philosophy from Golang (well, the reverse), and it's from the same creators.

I think the article already showed that with the case -F,

The trouble is commas inside quoted strings, as in the example "Smith, Bob".

I believe the field-separator option on awk will break the quoted string at the interior comma.

taylorpubsed.blogspot.com

Source: https://news.ycombinator.com/item?id=31350550

0 Response to "Awk Split String Print Then Print Again"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel