Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Josh @ Dreamland

2566
Announcements / Re: Parsers -- A novel by Josh @ Dreamland
« on: August 22, 2009, 08:40:33 PM »
I believe this is the reason compilation was just "too difficult" for the Game Maker team. They were all thinking in a box; this is the way it must be done. We must use a tree; we can't leave it to a far more experienced, far longer lived team such as the G++ one.

I'll say it again; it takes some nerve to even think we can compete with that compiler. It's totally out of the question to even try. Drop your fantasies, -- or at least, scenarios for usefulness of a token tree -- and realize that there's no need for it. Retro's right: the point is to make it so GCC, a far more capable compiler than either of us could hope to write, can read it. Only this, and nothing more.


You just told me you no longer want to use a complete tree, just a ...not tree... generated parser. So. why don't we let this war die until then.

In fact, until anyone has something to showcase, I don't really want to hear any more about it. The flaming is starting to get on my nerves. So show your knowledge and exhibit your idea by posting progress, not by flaming the hell out of anyone that disagrees with you.


As for me...
Progress: CFile parser is working quite nicely. There's a scope problem I'm just about to fix: checking if a variable already exists to report redeclaration needs to be sensitive to what scope the originally declared thing is in.

For example.

Code: [Select]
struct a;
namespace b;
{
  int a;
}

Currently bitches that I'm naming two types in the declaration (int and ::a). It shouldn't do that.

All right, solved, 12 minutes later.

2567
Announcements / Re: Parsers -- A novel by Josh @ Dreamland
« on: August 22, 2009, 12:32:29 PM »
Optimizing compiler should take care of all that, anyway.

2568
Announcements / Re: Parsers -- A novel by Josh @ Dreamland
« on: August 22, 2009, 10:09:02 AM »
That's why I've redone var at least 100 times.
I'll be modifying std::string next; have to make sure it can interface with var flawlessly. And boy do I have a plan for that.

2569
Announcements / Parsers -- A novel by Josh @ Dreamland
« on: August 21, 2009, 10:49:19 PM »
In fear of some readers being more pious than others, I'd like to clear some things up.
In other words, stop reading at the first big caption.


In summary, Rusky's trying to get me to use a token tree parser by starting work on one as a proof of concept. I assume he will continue to develop it from time to time, until it actually competes with the one I wrote as a competent proof of concept or until I'm all done. Whichever comes first.

I met it with some pretty harsh skepticism. But it's not because I feel token trees are bad.

A tree is a very useful tool. It will allow you to easily skip over large chunks of data to weed out only what you need. They are used in probably more than 90% of modern parsers. I'm not looking up an actual statistic, sorry math fans.

However, I don't feel they're appropriate for the formatter-type parser.

The entire point of the formatter parser is to add semicolons and weed out variable names by declaration. It doesn't really call for a tree, in my opinion.

Here's a sample code:
Code: [Select]
a=0b=1
That code is valid GML. So now, how do we deal with it?

Those who are weak of heart should stop reading now.

Token tree method:
The code is read into memory in a structure called a tree. A representation is depicted below:
Code: [Select]
=
├─a
└─0
=
├─b
└─1
To save size, a symbol table is used. It's basically a list of variable names/other things that appear in the code. The table makes comparison of strings from the time of its completion onward operate in constant time by comparing them as an index, not byte for byte. (Means it's efficient)
The code is then rebuilt with all proper punctuation.

My method:
The code undergoes a (somewhat) organized series of stream-like modification.

First, the code is reread into its current buffer free of comments and excess whitespace. This won't affect the code above.
Next, a second block of memory is allocated the size of the first. The first is named code, the second synt.

They look like this:
Code: [Select]
a=0b=1
Code: [Select]
n=0n=0
Next, the blocks are reallocated at twice the size for all upcoming changes. A pass is made that keeps a look out for patterns of tokens. You'll notice above in the syntax string that variable names have been replaced with n, while numbers have been replaced with 0. This is so only one check has to be made to tell what any given character in the code is a part of.

Notice the "0n". A pass is made to look out for things like that.


More complicated code
Trees are usually, or at least very often, done by using a parser generator of sort, which takes a grammar file that can be easily edited. It organizes the process by giving what I've seen to be a regex-like interface for manipulating what patterns of tokens the parser looks for.

My method's basically a hard-coded version of that. I specify what tokens require a semicolon between them in a separate function, instead of using a generator that reads that info from a text file.

In the code
Code: [Select]
if 0a=10else b=10 the token tree would do just what it did above, using if as a root of another sublevel, then putting the next = token inside it. It would then look up the rules for what's inside if to see what it puts next.

Mine does this, instead:   semis->push(')','s'); write_symbol('(');
Which does nothing but insert a left parenthesis and tell it, next time it needs to insert a semicolon, use ')' instead. This is still done in that pass.
(If I were using only find and replace like I said I could, I'd replace "if(.*);" with "if\($0\)"  )

Rusky said he didn't add for loops to his proof of concept. Wouldn't make much sense for me to do so, because all I had to do to implement for() was push two semicolons and an end parenthesis onto the stack, like I did above.


Partitioning my parser
Rusky's other complaint is that my parser is divided into sections. One's responsible for syntax check, one's responsible for C definitions, one's responsible for adding semicolons/other GM-C replacements. I don't really see what there's to criticize about that, but here's a nice reason to do so...

The formatter parser has more passes and gets more attention than both the syntax check and C parsers combined. However... They're what Rusky will call a disaster, and what I'll tell you are faster than greased lightning.

Instead of lexing, they keep a couple variables handy (integers, actually) that I like to refer to as registers. (I may in fact devote a register to each of them, as they are so few.) They let the CFile parser identify what it's going through now. Each C++ key word has an ID. (not dissimilar to the symbol table of the token tree).
In addition to this ID, a second variable is kept indicating mode. I keep track of these mentally, because that's how I goddamn roll. (Ask score_under). That's the part that really makes Rusky cringe; I rarely comment, and I don't document, because I just memorize all the values, or review my thinking to get back to them.

This ID, however, is the difference between "using" and "using namespace", just for example.

( Typedef is handled differently. Typedef's value is 128 or some crap, so I can just OR whatever I have by it to indicate that I'm typedef'ing it. (C and C++ will let you totally whore typedef) )

The most complicated thing to deal with is what may require a stack, and that's pointers to arrays to pointers....of functions.


Now, here's the kicker: I use a tree to store scope data. Le gasp, right? Each variable is stored the same way. An integer can have members if I really want to. And I can tell you just from using them there how beautiful a system they can be. But knowing that, I still say a pair of aligned strings is the way to go with a formatter.


Benefits/Downfalls
  • Tree offers O(1) string comparison. But the parser doesn't need that.
  • Tree does all punctuation in one pass. But it's liable to leave out, say, entire else statements. And not understand unary unless begged.
  • String requires multiple passes, but offers some find-replace methodology. To find a declaration, for example, I just search "d".
  • Tree is more comprehensive. I'm not sure if that's good or bad. I bring up unary again... Luis's parser had a fit on it. And Rusky's won't treat ++ as adding a unary-positive real value, or as incrementing the previous. (But he'll tell you how easy it is to add if you ask him)
  • Being component based means that my parser's syntax check only needs to do what's necessary for syntax checking... It never touches the C parser. The formatter needs to know the members of declared classes, so it calls the C parser to get that. Syntax check can just add the name as a type.

Okay, this is the single longest thing I've ever typed up. If you did any more than just skim it, I feel sorry for you.

2570
Announcements / Re: Fresh Topic
« on: August 21, 2009, 08:34:27 PM »
Well, hey, if any of you experts want to show me how it's done, feel free. When your parser can outperform mine, I'll use it. It's that simple.

Until then, however, we'll be sticking with what I'm coding.

2571
General ENIGMA / Re: Generated Parser
« on: August 21, 2009, 08:06:51 PM »
I'll do all those wonderful things when I, or someone on my team, actually intends to program support for any of them.

And Flash is the only one of those that will require total recoding of the parser; the rest is trivial modification. The syntax check will still be valid in any of those cases, and looky here, it ruins the entire rest of the project anyway because the only thing I know about Flash is that it's fuck ugly.

If it wasn't for integrating the entirety of C++, this parser would have been done within three days of my starting it. I can't think of a good reason to entirely support Flash even if I got hit in the head hard enough to want to support it.

So basically, Flash and Obj-C are off topic and out of the question.


And, as I said in advance -- as in, before you could even bring it up again -- I don't care how trivial the mistake is. Point is, it was made. You'd have to put out some really, really nasty code to ever make my parser put a semicolon where it doesn't belong, and that's the only mistake it's capable of making.

Where can I see my parser's outcome not being entirely predictable for all audiences?
Code: [Select]
a = b++ c

That's it. That's the only real ambiguity.

And idiotic mistakes in mine... heh. If it works once, it won't just stop working. Due to everything being done procedurally; each step makes assumptions about how the code is formatted based on the imminent success of the previous.

On the subject of which, I think you missed something from the first thing I said.
Code: [Select]
a=0b=1c=2became
Code: [Select]
a=0;b=1c=2
Note that it only added the semicolon once. The second time, nope. Just keeps ignoring further semicolons. Another simple fix, I say! Look how simple this is to get working if you stare at it with a magnifying glass for hours on end, testing everything over again each time to make sure two definitions in your grammar file aren't conflicting; making sure that no order of statements is going to cause total collapse.


I'll give you this, though: I did forget one thing last time. I totally forgot about allowing .7 as well as 0.7, which led to a semicolon being added. In retrospect, I should have been replacing .0 with 00 anyway, because 0.n is valid. Either way, boy did I feel stupid. That said, never did I manage to leave out half the goddamn code just because of "a single if statement that I left out".


Anyway, feel free to continue working on this. If it beats mine in a benchmark, -- that other silly little factor that just doesn't apply to something so great as a token tree -- I'll be happy to use it. And I know it doesn't matter how fast to you, but let me tell you, it certainly matters to me. Why?

I wish you luck with the C++ part of that. You'll need it. I've heard horror stories from "professionals" who were about as happy to hear of my parser's methodology as you, who tried it with the almighty token tree parser and had a record breaking parser time of ten seconds for an STL header. It'll be pretty hard for me to manage a time like that in one pass.

2572
General ENIGMA / Re: Enigma IDE (written in C++ using GTK+)
« on: August 21, 2009, 07:35:52 PM »
I was honestly bound to make that typo at some point.

Also, I don't see a problem with LGM, but I'm all for an official IDE. The hard part will be the room and path editors, I imagine.

2573
General ENIGMA / Re: Generated Parser
« on: August 20, 2009, 08:57:31 PM »
I stopped when a=0b=1c=2 came out as a=0;b=1c=2.

With my parser, if one works, they all work.

And on for() not being implemented in your parser... When I was done with if(), for() was added just by saying this:
Code: [Select]
      sy_semi=sy_semi->push(')','s');
      sy_semi=sy_semi->push(';','s');
      sy_semi=sy_semi->push(';','s');

Where if() was done with only the first line of that code.

The only thing you've demonstrated is how something can work once, and then just stop. The only way mine would ever stop working is if I managed to dis-align the code string and token string. But only a real twat would do that.

I'm writing the parser to do exactly what I want done. It's more focused than your all-purpose tree, and works independent of context. The context parsers are the syntax checker and the C Parser. Neither of those need to use a lexer; they are reading the code just as you and I do and writing things down as they go.

It would be aggravating, but doable, to write this GML-C++ formatter parser without doing any sort of lexing. But doing it this way is so much more general...

The only thing I'm not supporting from the entire C++ language with each of these parsers is literal suffixes, as in 100000000L or 0.5f. I find they're unnecessary anyway.

Even things like struct a {}; struct a b; rather than just a b; can all be so greatly generalized this way.

A tree just sounds like a great way to over-complicate things.


Edit:
I just tried a simple if statement to see if that'd work, and I noticed that it replaced = with ==. Which was nice, until I noticed it also deletes else{} statements. I'll drop dead before something that stupid happens in my parser.
And I don't care how easy it is to fix the goddamn thing. Point is, you somehow managed to delete else(), and that's not the first or only thing you can really fuck up with a token tree.

2574
General ENIGMA / Re: Enigma IDE (written in C++ using GTK+)
« on: August 20, 2009, 07:52:38 AM »
I can see it feeling that way because everything that goes on between LGM and ENIGMA is basically a hack.
We tried making ENIMGA a DLL for LGM to load, but that didn't go so well.
You and I probably need to try that again, Ism.

It also doesn't help that Java performs differently on every computer I've tried it on.

2575
Announcements / Re: Fresh Topic
« on: August 19, 2009, 10:04:23 PM »
Progress:
I'm signed up for all the courses I wanted at the local university.
Also, beginning the cross-communication cycle between the parsers. (Or pieces of parsers).

Storm's abrewin'.

Nothing earth shattering, sorry. Mostly boring things; making sure hex works. Trying nested statements and parentheses. Making sure none of the optimizations and generalizations I make kill anything.

Good thing I like parsers. ^_^

2576
Announcements / Re: Fresh Topic
« on: August 18, 2009, 07:17:29 AM »
I know a certain someone with a top hat fetish.

He also starves bugs to death, though.

Progress:
Parser does everything it used to. Only better, of course, due to extensibility via the CFile parser.
Oh, which means I lied. The parser doesn't support var yet, because var's not a primitive. I'm considering treating it like one, though, so var.whatever never accesses a member of var. Perhaps instead I'll make var's members private, and make sure the parser doesn't abduct any private members.

If you people need something to argue about, I can probably give one of my old controversies that making ENIGMA totally support C has solved... heheheh.

2577
Announcements / Re: Fresh Topic
« on: August 17, 2009, 03:33:36 PM »
Fedora isn't working at all, and the more I fight with it the angrier it gets.

I'm gonna reinstall the OS at some point, but for now I'm just working on the parser.

2578
General ENIGMA / Re: Mixed arrays
« on: August 13, 2009, 04:55:53 PM »
it's supported that since R1... Though it's been recoded multiple times since, so.

2579
Announcements / Re: Fresh Topic
« on: August 13, 2009, 04:51:32 PM »
Tried it? I had to blacklist it. Which fixed nothing. Prime.

2580
Announcements / Re: Fresh Topic
« on: August 13, 2009, 02:11:51 PM »
There's no correct driver
*starts mumbling about the apocalypse*