Quantcast
Channel: nthykier
Viewing all articles
Browse latest Browse all 86

Parsing bash/shell

$
0
0

I have been avoiding #629247 for quite a while. Not because I think we couldn’t use a better shell parser, but because I dreaded having to write the parser. Of course, #629247 blocks about 16 bugs and that number will only increase, so “someone” has to solve it eventually… Unfortunately, that “someone” is likely to be “me”.  So…

I managed to scrabble down the following Perl snippet. It does a decent job at getting lines split into “words” (which may or may not contain spaces, newlines, quotes etc.). It currently tokenizes the “<<EOF”-constructs (heredocs?).  Also it does not allow one to distinguish between “EOF” and ” EOF” (the former ends the heredoc, the latter doesn’t.).

Other defects includes that it does not tokenize all operators (like “>&”).  Probably all I need is a list of them and all the “special cases” (Example: “>&” can optionally take numbers on both sides, like “>&2″ or “2>&1″).

It does not always appear to terminate (I think EOF + unclosed quote triggers this).  If you try it out and notice something funny, please let me know.

You can also find an older version of it in the bug #629247 and the output it produced at that time (that version used ” instead of – as token marker).

#!/usr/bin/perl

use strict;
use warnings;

use Text::ParseWords qw(quotewords);
my $opregex;

{
    my $tmp = join( "|", map { quotemeta $_ } qw (&& || | ; ));
    # Match & but not >& or <&
    # - Actually, it should eventually match those, but not right now.
    $tmp .= '|(?<![\>\<])\&';
    $opregex = qr/$tmp/ox;
}
my @tokens = ();
my $lno;
while (my $line = <>) {
    chomp $line;
    next if $line =~ m/^\s*(?:\#|$)/;
    $lno = $. unless defined $lno;
    while ($line =~ s,\\$,,) {
        $line .= "\n" . <>;
        chomp $line;
    }
    $line =~ s/^\s++//;
    $line =~ s/\s++$//;
    # Ignore empty lines (again, via "$empty \ $empty"-constructs)
    next if $line =~ m/^\s*(?:\#|$)/;

    my @it = quotewords ($opregex, 'delimiters', $line);
    if (!@it) {
        # This happens if the line has unbalanced quotes, so pop another
        # line and redo the loop.
        $line .= "\n" . <>;
        redo;
    }

    foreach my $orig (@it) {
        my @l;
        $orig =~ s,",\\\\",g;
        @l = quotewords (qr/\s++/, 1, $orig);
        pop @l unless defined $l[-1] && $l[-1] ne '';
        shift @l if $l[0] eq '';
        push @tokens, map { s,\\\\",",g; $_ } @l;
    }
    print "Line $lno: -" . join ("- -", map { s/\n/\\n/g; $_ } @tokens ) . "-\n";
    @tokens = ();
    $lno = undef;
}

Here is a little example script and the “tokenization” of that script (no, the example script is not supposed to be useful).

$ cat test
#!/bin/sh

for p in *; do
    if [ -d "$p" ];then continue;elif
    [ -f "$p" ]
    then echo "$p is a file";fi
done
$ ./test.pl test
Line 3: -for- -p- -in- -*- -;- -do-
Line 4: -if- -[- --d- -"$p"- -]- -;- -then- -continue- -;- -elif-
Line 5: -[- --f- -"$p"- -]-
Line 6: -then- -echo- -"$p is a file"- -;- -fi-
Line 7: -done-


Viewing all articles
Browse latest Browse all 86

Trending Articles