I have been avoiding #629247 for quite a while. Not because I think we couldn’t use a better shell parser, but because I dreaded having to write the parser. Of course, #629247 blocks about 16 bugs and that number will only increase, so “someone” has to solve it eventually… Unfortunately, that “someone” is likely to be “me”. So…
I managed to scrabble down the following Perl snippet. It does a decent job at getting lines split into “words” (which may or may not contain spaces, newlines, quotes etc.). It currently tokenizes the “<<EOF”-constructs (heredocs?). Also it does not allow one to distinguish between “EOF” and ” EOF” (the former ends the heredoc, the latter doesn’t.).
Other defects includes that it does not tokenize all operators (like “>&”). Probably all I need is a list of them and all the “special cases” (Example: “>&” can optionally take numbers on both sides, like “>&2″ or “2>&1″).
It does not always appear to terminate (I think EOF + unclosed quote triggers this). If you try it out and notice something funny, please let me know.
You can also find an older version of it in the bug #629247 and the output it produced at that time (that version used ” instead of – as token marker).
#!/usr/bin/perl use strict; use warnings; use Text::ParseWords qw(quotewords); my $opregex; { my $tmp = join( "|", map { quotemeta $_ } qw (&& || | ; )); # Match & but not >& or <& # - Actually, it should eventually match those, but not right now. $tmp .= '|(?<![\>\<])\&'; $opregex = qr/$tmp/ox; } my @tokens = (); my $lno; while (my $line = <>) { chomp $line; next if $line =~ m/^\s*(?:\#|$)/; $lno = $. unless defined $lno; while ($line =~ s,\\$,,) { $line .= "\n" . <>; chomp $line; } $line =~ s/^\s++//; $line =~ s/\s++$//; # Ignore empty lines (again, via "$empty \ $empty"-constructs) next if $line =~ m/^\s*(?:\#|$)/; my @it = quotewords ($opregex, 'delimiters', $line); if (!@it) { # This happens if the line has unbalanced quotes, so pop another # line and redo the loop. $line .= "\n" . <>; redo; } foreach my $orig (@it) { my @l; $orig =~ s,",\\\\",g; @l = quotewords (qr/\s++/, 1, $orig); pop @l unless defined $l[-1] && $l[-1] ne ''; shift @l if $l[0] eq ''; push @tokens, map { s,\\\\",",g; $_ } @l; } print "Line $lno: -" . join ("- -", map { s/\n/\\n/g; $_ } @tokens ) . "-\n"; @tokens = (); $lno = undef; }
Here is a little example script and the “tokenization” of that script (no, the example script is not supposed to be useful).
$ cat test #!/bin/sh for p in *; do if [ -d "$p" ];then continue;elif [ -f "$p" ] then echo "$p is a file";fi done $ ./test.pl test Line 3: -for- -p- -in- -*- -;- -do- Line 4: -if- -[- --d- -"$p"- -]- -;- -then- -continue- -;- -elif- Line 5: -[- --f- -"$p"- -]- Line 6: -then- -echo- -"$p is a file"- -;- -fi- Line 7: -done-
