Pages: 1 2 3 4 »
  Print  
Author Topic: Parsers -- A novel by Josh @ Dreamland  (Read 11520 times)
Offline (Male) Josh @ Dreamland
Posted on: August 21, 2009, 10:49:19 pm

Prince of all Goldfish
Developer
Location: Pittsburgh, PA, USA
Joined: Feb 2008
Posts: 2950

View Profile Email
In fear of some readers being more pious than others, I'd like to clear some things up.
In other words, stop reading at the first big caption.


In summary, Rusky's trying to get me to use a token tree parser by starting work on one as a proof of concept. I assume he will continue to develop it from time to time, until it actually competes with the one I wrote as a competent proof of concept or until I'm all done. Whichever comes first.

I met it with some pretty harsh skepticism. But it's not because I feel token trees are bad.

A tree is a very useful tool. It will allow you to easily skip over large chunks of data to weed out only what you need. They are used in probably more than 90% of modern parsers. I'm not looking up an actual statistic, sorry math fans.

However, I don't feel they're appropriate for the formatter-type parser.

The entire point of the formatter parser is to add semicolons and weed out variable names by declaration. It doesn't really call for a tree, in my opinion.

Here's a sample code:
Code: [Select]
a=0b=1
That code is valid GML. So now, how do we deal with it?

Those who are weak of heart should stop reading now.

Token tree method:
The code is read into memory in a structure called a tree. A representation is depicted below:
Code: [Select]
=
├─a
└─0
=
├─b
└─1
To save size, a symbol table is used. It's basically a list of variable names/other things that appear in the code. The table makes comparison of strings from the time of its completion onward operate in constant time by comparing them as an index, not byte for byte. (Means it's efficient)
The code is then rebuilt with all proper punctuation.

My method:
The code undergoes a (somewhat) organized series of stream-like modification.

First, the code is reread into its current buffer free of comments and excess whitespace. This won't affect the code above.
Next, a second block of memory is allocated the size of the first. The first is named code, the second synt.

They look like this:
Code: [Select]
a=0b=1
Code: [Select]
n=0n=0
Next, the blocks are reallocated at twice the size for all upcoming changes. A pass is made that keeps a look out for patterns of tokens. You'll notice above in the syntax string that variable names have been replaced with n, while numbers have been replaced with 0. This is so only one check has to be made to tell what any given character in the code is a part of.

Notice the "0n". A pass is made to look out for things like that.


More complicated code
Trees are usually, or at least very often, done by using a parser generator of sort, which takes a grammar file that can be easily edited. It organizes the process by giving what I've seen to be a regex-like interface for manipulating what patterns of tokens the parser looks for.

My method's basically a hard-coded version of that. I specify what tokens require a semicolon between them in a separate function, instead of using a generator that reads that info from a text file.

In the code
Code: [Select]
if 0a=10else b=10 the token tree would do just what it did above, using if as a root of another sublevel, then putting the next = token inside it. It would then look up the rules for what's inside if to see what it puts next.

Mine does this, instead:   semis->push(')','s'); write_symbol('(');
Which does nothing but insert a left parenthesis and tell it, next time it needs to insert a semicolon, use ')' instead. This is still done in that pass.
(If I were using only find and replace like I said I could, I'd replace "if(.*);" with "if\($0\)"  )

Rusky said he didn't add for loops to his proof of concept. Wouldn't make much sense for me to do so, because all I had to do to implement for() was push two semicolons and an end parenthesis onto the stack, like I did above.


Partitioning my parser
Rusky's other complaint is that my parser is divided into sections. One's responsible for syntax check, one's responsible for C definitions, one's responsible for adding semicolons/other GM-C replacements. I don't really see what there's to criticize about that, but here's a nice reason to do so...

The formatter parser has more passes and gets more attention than both the syntax check and C parsers combined. However... They're what Rusky will call a disaster, and what I'll tell you are faster than greased lightning.

Instead of lexing, they keep a couple variables handy (integers, actually) that I like to refer to as registers. (I may in fact devote a register to each of them, as they are so few.) They let the CFile parser identify what it's going through now. Each C++ key word has an ID. (not dissimilar to the symbol table of the token tree).
In addition to this ID, a second variable is kept indicating mode. I keep track of these mentally, because that's how I goddamn roll. (Ask score_under). That's the part that really makes Rusky cringe; I rarely comment, and I don't document, because I just memorize all the values, or review my thinking to get back to them.

This ID, however, is the difference between "using" and "using namespace", just for example.

( Typedef is handled differently. Typedef's value is 128 or some crap, so I can just OR whatever I have by it to indicate that I'm typedef'ing it. (C and C++ will let you totally whore typedef) )

The most complicated thing to deal with is what may require a stack, and that's pointers to arrays to pointers....of functions.


Now, here's the kicker: I use a tree to store scope data. Le gasp, right? Each variable is stored the same way. An integer can have members if I really want to. And I can tell you just from using them there how beautiful a system they can be. But knowing that, I still say a pair of aligned strings is the way to go with a formatter.


Benefits/Downfalls
  • Tree offers O(1) string comparison. But the parser doesn't need that.
  • Tree does all punctuation in one pass. But it's liable to leave out, say, entire else statements. And not understand unary unless begged.
  • String requires multiple passes, but offers some find-replace methodology. To find a declaration, for example, I just search "d".
  • Tree is more comprehensive. I'm not sure if that's good or bad. I bring up unary again... Luis's parser had a fit on it. And Rusky's won't treat ++ as adding a unary-positive real value, or as incrementing the previous. (But he'll tell you how easy it is to add if you ask him)
  • Being component based means that my parser's syntax check only needs to do what's necessary for syntax checking... It never touches the C parser. The formatter needs to know the members of declared classes, so it calls the C parser to get that. Syntax check can just add the name as a type.

Okay, this is the single longest thing I've ever typed up. If you did any more than just skim it, I feel sorry for you.
Logged
"That is the single most cryptic piece of code I have ever seen." -Master PobbleWobble
"I disapprove of what you say, but I will defend to the death your right to say it." -Evelyn Beatrice Hall, Friends of Voltaire
Offline (Male) RetroX
Reply #1 Posted on: August 22, 2009, 09:52:35 am

Master of all things Linux
Contributor
Location: US
Joined: Apr 2008
Posts: 1055
MSN Messenger - classixretrox@gmail.com
View Profile Email
I comment my code anyways, but I normally don't comment it much.  Normally, I just put comments at the tops of different sections so I know what they do without looking in the code (even though I also make variable names and function names long and precise, which makes comments kind of pointless, but whatever).

I normally don't comment code like:
Code: [Select]
// if yspeed is greater than 0
if (yspeed>0)

Because that's obviously pointless.

Anyways, to be entirely honest, I never really understood how exactly your parser worked until now, and to be honest, it sounds like an efficient method that I might have actually thought of myself if I spent enough time on it.  Unlike other people, I know when my code could be done better, and I normally don't settle for anything less.  If I know I can do better, then I won't settle for a "good" method.

Probably why I don't get many projects done, lol.
Logged
My Box: Phenom II 3.4GHz X4 | ASUS ATI RadeonHD 5770, 1GB GDDR5 RAM | 1x4GB DDR3 SRAM | Arch Linux, x86_64 (Cube) / Windows 7 x64 (Blob)
Quote from: Fede-lasse
Why do all the pro-Microsoft people have troll avatars? :(
Offline (Male) Josh @ Dreamland
Reply #2 Posted on: August 22, 2009, 10:09:02 am

Prince of all Goldfish
Developer
Location: Pittsburgh, PA, USA
Joined: Feb 2008
Posts: 2950

View Profile Email
That's why I've redone var at least 100 times.
I'll be modifying std::string next; have to make sure it can interface with var flawlessly. And boy do I have a plan for that.
Logged
"That is the single most cryptic piece of code I have ever seen." -Master PobbleWobble
"I disapprove of what you say, but I will defend to the death your right to say it." -Evelyn Beatrice Hall, Friends of Voltaire
Offline (Male) RetroX
Reply #3 Posted on: August 22, 2009, 10:12:29 am

Master of all things Linux
Contributor
Location: US
Joined: Apr 2008
Posts: 1055
MSN Messenger - classixretrox@gmail.com
View Profile Email
I even modify my code to be consistent if I change one thing.  I used to use i+=1, which is okay, but once I changed it to i++ which was faster, I had to change all my previous scripts.  Same thing goes for ==true and ==false.
Logged
My Box: Phenom II 3.4GHz X4 | ASUS ATI RadeonHD 5770, 1GB GDDR5 RAM | 1x4GB DDR3 SRAM | Arch Linux, x86_64 (Cube) / Windows 7 x64 (Blob)
Quote from: Fede-lasse
Why do all the pro-Microsoft people have troll avatars? :(
Offline (Male) Josh @ Dreamland
Reply #4 Posted on: August 22, 2009, 12:32:29 pm

Prince of all Goldfish
Developer
Location: Pittsburgh, PA, USA
Joined: Feb 2008
Posts: 2950

View Profile Email
Optimizing compiler should take care of all that, anyway.
Logged
"That is the single most cryptic piece of code I have ever seen." -Master PobbleWobble
"I disapprove of what you say, but I will defend to the death your right to say it." -Evelyn Beatrice Hall, Friends of Voltaire
Offline (Male) RetroX
Reply #5 Posted on: August 22, 2009, 12:37:46 pm

Master of all things Linux
Contributor
Location: US
Joined: Apr 2008
Posts: 1055
MSN Messenger - classixretrox@gmail.com
View Profile Email
Yeah, but I still do it, anyways.  And interpreted languages like PHP aren't optimized.  In PHP, I even unset variables after I used them to make code run faster.
Logged
My Box: Phenom II 3.4GHz X4 | ASUS ATI RadeonHD 5770, 1GB GDDR5 RAM | 1x4GB DDR3 SRAM | Arch Linux, x86_64 (Cube) / Windows 7 x64 (Blob)
Quote from: Fede-lasse
Why do all the pro-Microsoft people have troll avatars? :(
Offline (Unknown gender) Micah
Reply #6 Posted on: August 22, 2009, 12:49:07 pm

Resident Troll
Joined: May 2008
Posts: 128

View Profile
I even modify my code to be consistent if I change one thing.  I used to use i+=1, which is okay, but once I changed it to i++ which was faster, I had to change all my previous scripts.
Compiler optimizations aside, ++i is faster than i++, because i++ has to create a copy of i to evaluate to and increment the real one, while ++i just increments it.

Although I'll bet it doesn't matter with a decent compiler.
Logged
Offline (Male) Rusky
Reply #7 Posted on: August 22, 2009, 01:14:02 pm

Resident Troll
Joined: Feb 2008
Posts: 954
MSN Messenger - rpjohnst@gmail.com
View Profile WWW Email
The tree didn't leave out else. The printing code at the end left out else, and it took a single line to put it in: if (operands.size() == 3) { printf("else{"); operands[3]->print(); putchar('}'); }
Your complaints about unary are completely pointless. "Unless begged"? More like "unless specified". You use the same character, '-', for both binary and unary minus. All you have to do is add another operator.
Your special find-replace crap is a lot slower than an LR parser. Having to search through the entire string again to find declarations is dumb when you could just build a table of declarations as you go.
a = b ++ c is a problem in any parser, not just mine.
My parser is divided into phases. Each phase uses the information from the previous one, so it doesn't have to do any extra work, and it doesn't have to be divided into different programs either.

You say yours is faster, but is it more memory efficient? You have to store at least twice the size of the program. In programs that have a lot of uses of the same name, that wastes a lot of space. In programs that use identifiers like begin/end, if/else, anything like that, you waste space. Also, because you're using a tree for scope and a few stacks for parentheses, etc. you waste even more space- you're basically duplicating information.

The way you described a generated parser is wrong. You only use regular expressions for the lexer, and you don't even have to do that- you could write the lexer yourself if you wanted. All the lexer needs to do is separate the program into a list of tokens and send them to the parser. Then the parser uses a grammar, which looks something like this:
Code: [Select]
expr:
    NUMBER
  | NAME
  | '-' expr %prec UMINUS
  | expr '+' expr
  | expr '-' expr
  | expr '*' expr
  | expr '/' expr
  | '(' expr ')'
  ;
That tells the parser that an expression can be a number, a name, a negated expression (recursive definitions are awesome), an expression plus an expression, etc. The operators +, -, *, / and UMINUS have their precedences defined earlier. The %prec UMINUS tells the generator that the - in that production is the unary one (so it has higher precedence).
Logged
Offline (Unknown gender) luiscubal
Reply #8 Posted on: August 22, 2009, 04:24:32 pm
Member
Joined: Jun 2009
Posts: 452

View Profile Email
On Antlr, your snippet will cause a problem because it's left-recursive. I don't know about other generators.
Logged
Offline (Male) Rusky
Reply #9 Posted on: August 22, 2009, 08:13:55 pm

Resident Troll
Joined: Feb 2008
Posts: 954
MSN Messenger - rpjohnst@gmail.com
View Profile WWW Email
No, in antlr you're supposed to be left-recursive. Unless antlr sucks worse than I thought and it doesn't do LR parsers.
Logged
Offline (Male) RetroX
Reply #10 Posted on: August 22, 2009, 08:15:18 pm

Master of all things Linux
Contributor
Location: US
Joined: Apr 2008
Posts: 1055
MSN Messenger - classixretrox@gmail.com
View Profile Email
Rusky, I don't see why you're trying so hard.  Josh's method is more efficient, and while it may seem more complex and less readable, there is no reason why it should not be obvious that it's a better method.

The lightbulb was originally an idea of something that burns so much electricity that it creates heat and then light (although hopefully more light than heat, unless it's a heat lamp).  Then, someone thought of florescent lightbulbs.  How do they work?  They glow.  That's it.  No burning up electricity in an extremely inefficient matter.  They just work.

A tree might work.  It was what was thought of first.  Is it the best method?
Logged
My Box: Phenom II 3.4GHz X4 | ASUS ATI RadeonHD 5770, 1GB GDDR5 RAM | 1x4GB DDR3 SRAM | Arch Linux, x86_64 (Cube) / Windows 7 x64 (Blob)
Quote from: Fede-lasse
Why do all the pro-Microsoft people have troll avatars? :(
Offline (Male) Rusky
Reply #11 Posted on: August 22, 2009, 08:17:41 pm

Resident Troll
Joined: Feb 2008
Posts: 954
MSN Messenger - rpjohnst@gmail.com
View Profile WWW Email
It's not merely less readable, it's unreadable. Also, this one is extremely readable, and you have no proof that it's more efficient.

Your little comment about the lightbulb is moronic. Have you ever written anything remotely parser-related? The only reason Josh's method is even maybe acceptable is because he's not really parsing it. He's just prettifying it so GCC can parse it. If he were going to compile it himself, his method would not work at all.
Logged
Offline (Male) RetroX
Reply #12 Posted on: August 22, 2009, 08:28:59 pm

Master of all things Linux
Contributor
Location: US
Joined: Apr 2008
Posts: 1055
MSN Messenger - classixretrox@gmail.com
View Profile Email
The only reason Josh's method is even maybe acceptable is because he's not really parsing it. He's just prettifying it so GCC can parse it. If he were going to compile it himself, his method would not work at all.
Which is what the parser is supposed to do!  It's not a syntax checker.  It parses Game Maker syntax into C++ syntax.  If it's not good enough for Game Maker to read, it's not going to be parsed correctly, and it won't compile.  However, if Game Maker can read it, it will be made readable by the compiler.
Logged
My Box: Phenom II 3.4GHz X4 | ASUS ATI RadeonHD 5770, 1GB GDDR5 RAM | 1x4GB DDR3 SRAM | Arch Linux, x86_64 (Cube) / Windows 7 x64 (Blob)
Quote from: Fede-lasse
Why do all the pro-Microsoft people have troll avatars? :(
Offline (Male) Josh @ Dreamland
Reply #13 Posted on: August 22, 2009, 08:40:33 pm

Prince of all Goldfish
Developer
Location: Pittsburgh, PA, USA
Joined: Feb 2008
Posts: 2950

View Profile Email
I believe this is the reason compilation was just "too difficult" for the Game Maker team. They were all thinking in a box; this is the way it must be done. We must use a tree; we can't leave it to a far more experienced, far longer lived team such as the G++ one.

I'll say it again; it takes some nerve to even think we can compete with that compiler. It's totally out of the question to even try. Drop your fantasies, -- or at least, scenarios for usefulness of a token tree -- and realize that there's no need for it. Retro's right: the point is to make it so GCC, a far more capable compiler than either of us could hope to write, can read it. Only this, and nothing more.


You just told me you no longer want to use a complete tree, just a ...not tree... generated parser. So. why don't we let this war die until then.

In fact, until anyone has something to showcase, I don't really want to hear any more about it. The flaming is starting to get on my nerves. So show your knowledge and exhibit your idea by posting progress, not by flaming the hell out of anyone that disagrees with you.


As for me...
Progress: CFile parser is working quite nicely. There's a scope problem I'm just about to fix: checking if a variable already exists to report redeclaration needs to be sensitive to what scope the originally declared thing is in.

For example.

Code: [Select]
struct a;
namespace b;
{
  int a;
}

Currently bitches that I'm naming two types in the declaration (int and ::a). It shouldn't do that.

All right, solved, 12 minutes later.
« Last Edit: August 22, 2009, 08:56:05 pm by Josh @ Dreamland » Logged
"That is the single most cryptic piece of code I have ever seen." -Master PobbleWobble
"I disapprove of what you say, but I will defend to the death your right to say it." -Evelyn Beatrice Hall, Friends of Voltaire
Offline (Unknown gender) luiscubal
Reply #14 Posted on: August 23, 2009, 08:30:33 am
Member
Joined: Jun 2009
Posts: 452

View Profile Email
Actually, left recursivity is a fatal error in Antlr. So it does suck quite a lot ;)

Also, I think you're overkilling a few things. GCC does a lot, so why waste effort? GCC has a preprocessor. So why bother working on one? You could have the EDL go through the preprocessor, then your parser and THEN the C compiler itself. It could work...
Logged
Pages: 1 2 3 4 »
  Print