ENIGMA Development Environment
Website is in read-only mode due to a recent attack.

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Josh @ Dreamland

2566
Announcements / Re: Parsers -- A novel by Josh @ Dreamland
« on: August 22, 2009, 10:09:02 AM »
That's why I've redone var at least 100 times.
I'll be modifying std::string next; have to make sure it can interface with var flawlessly. And boy do I have a plan for that.

2567
Announcements / Parsers -- A novel by Josh @ Dreamland
« on: August 21, 2009, 10:49:19 PM »
In fear of some readers being more pious than others, I'd like to clear some things up.
In other words, stop reading at the first big caption.


In summary, Rusky's trying to get me to use a token tree parser by starting work on one as a proof of concept. I assume he will continue to develop it from time to time, until it actually competes with the one I wrote as a competent proof of concept or until I'm all done. Whichever comes first.

I met it with some pretty harsh skepticism. But it's not because I feel token trees are bad.

A tree is a very useful tool. It will allow you to easily skip over large chunks of data to weed out only what you need. They are used in probably more than 90% of modern parsers. I'm not looking up an actual statistic, sorry math fans.

However, I don't feel they're appropriate for the formatter-type parser.

The entire point of the formatter parser is to add semicolons and weed out variable names by declaration. It doesn't really call for a tree, in my opinion.

Here's a sample code:
Code: [Select]
a=0b=1
That code is valid GML. So now, how do we deal with it?

Those who are weak of heart should stop reading now.

Token tree method:
The code is read into memory in a structure called a tree. A representation is depicted below:
Code: [Select]
=
├─a
└─0
=
├─b
└─1
To save size, a symbol table is used. It's basically a list of variable names/other things that appear in the code. The table makes comparison of strings from the time of its completion onward operate in constant time by comparing them as an index, not byte for byte. (Means it's efficient)
The code is then rebuilt with all proper punctuation.

My method:
The code undergoes a (somewhat) organized series of stream-like modification.

First, the code is reread into its current buffer free of comments and excess whitespace. This won't affect the code above.
Next, a second block of memory is allocated the size of the first. The first is named code, the second synt.

They look like this:
Code: [Select]
a=0b=1
Code: [Select]
n=0n=0
Next, the blocks are reallocated at twice the size for all upcoming changes. A pass is made that keeps a look out for patterns of tokens. You'll notice above in the syntax string that variable names have been replaced with n, while numbers have been replaced with 0. This is so only one check has to be made to tell what any given character in the code is a part of.

Notice the "0n". A pass is made to look out for things like that.


More complicated code
Trees are usually, or at least very often, done by using a parser generator of sort, which takes a grammar file that can be easily edited. It organizes the process by giving what I've seen to be a regex-like interface for manipulating what patterns of tokens the parser looks for.

My method's basically a hard-coded version of that. I specify what tokens require a semicolon between them in a separate function, instead of using a generator that reads that info from a text file.

In the code
Code: [Select]
if 0a=10else b=10 the token tree would do just what it did above, using if as a root of another sublevel, then putting the next = token inside it. It would then look up the rules for what's inside if to see what it puts next.

Mine does this, instead:   semis->push(')','s'); write_symbol('(');
Which does nothing but insert a left parenthesis and tell it, next time it needs to insert a semicolon, use ')' instead. This is still done in that pass.
(If I were using only find and replace like I said I could, I'd replace "if(.*);" with "if\($0\)"  )

Rusky said he didn't add for loops to his proof of concept. Wouldn't make much sense for me to do so, because all I had to do to implement for() was push two semicolons and an end parenthesis onto the stack, like I did above.


Partitioning my parser
Rusky's other complaint is that my parser is divided into sections. One's responsible for syntax check, one's responsible for C definitions, one's responsible for adding semicolons/other GM-C replacements. I don't really see what there's to criticize about that, but here's a nice reason to do so...

The formatter parser has more passes and gets more attention than both the syntax check and C parsers combined. However... They're what Rusky will call a disaster, and what I'll tell you are faster than greased lightning.

Instead of lexing, they keep a couple variables handy (integers, actually) that I like to refer to as registers. (I may in fact devote a register to each of them, as they are so few.) They let the CFile parser identify what it's going through now. Each C++ key word has an ID. (not dissimilar to the symbol table of the token tree).
In addition to this ID, a second variable is kept indicating mode. I keep track of these mentally, because that's how I goddamn roll. (Ask score_under). That's the part that really makes Rusky cringe; I rarely comment, and I don't document, because I just memorize all the values, or review my thinking to get back to them.

This ID, however, is the difference between "using" and "using namespace", just for example.

( Typedef is handled differently. Typedef's value is 128 or some crap, so I can just OR whatever I have by it to indicate that I'm typedef'ing it. (C and C++ will let you totally whore typedef) )

The most complicated thing to deal with is what may require a stack, and that's pointers to arrays to pointers....of functions.


Now, here's the kicker: I use a tree to store scope data. Le gasp, right? Each variable is stored the same way. An integer can have members if I really want to. And I can tell you just from using them there how beautiful a system they can be. But knowing that, I still say a pair of aligned strings is the way to go with a formatter.


Benefits/Downfalls
  • Tree offers O(1) string comparison. But the parser doesn't need that.
  • Tree does all punctuation in one pass. But it's liable to leave out, say, entire else statements. And not understand unary unless begged.
  • String requires multiple passes, but offers some find-replace methodology. To find a declaration, for example, I just search "d".
  • Tree is more comprehensive. I'm not sure if that's good or bad. I bring up unary again... Luis's parser had a fit on it. And Rusky's won't treat ++ as adding a unary-positive real value, or as incrementing the previous. (But he'll tell you how easy it is to add if you ask him)
  • Being component based means that my parser's syntax check only needs to do what's necessary for syntax checking... It never touches the C parser. The formatter needs to know the members of declared classes, so it calls the C parser to get that. Syntax check can just add the name as a type.

Okay, this is the single longest thing I've ever typed up. If you did any more than just skim it, I feel sorry for you.

2568
Announcements / Re: Fresh Topic
« on: August 21, 2009, 08:34:27 PM »
Well, hey, if any of you experts want to show me how it's done, feel free. When your parser can outperform mine, I'll use it. It's that simple.

Until then, however, we'll be sticking with what I'm coding.

2569
General ENIGMA / Re: Generated Parser
« on: August 21, 2009, 08:06:51 PM »
I'll do all those wonderful things when I, or someone on my team, actually intends to program support for any of them.

And Flash is the only one of those that will require total recoding of the parser; the rest is trivial modification. The syntax check will still be valid in any of those cases, and looky here, it ruins the entire rest of the project anyway because the only thing I know about Flash is that it's fuck ugly.

If it wasn't for integrating the entirety of C++, this parser would have been done within three days of my starting it. I can't think of a good reason to entirely support Flash even if I got hit in the head hard enough to want to support it.

So basically, Flash and Obj-C are off topic and out of the question.


And, as I said in advance -- as in, before you could even bring it up again -- I don't care how trivial the mistake is. Point is, it was made. You'd have to put out some really, really nasty code to ever make my parser put a semicolon where it doesn't belong, and that's the only mistake it's capable of making.

Where can I see my parser's outcome not being entirely predictable for all audiences?
Code: [Select]
a = b++ c

That's it. That's the only real ambiguity.

And idiotic mistakes in mine... heh. If it works once, it won't just stop working. Due to everything being done procedurally; each step makes assumptions about how the code is formatted based on the imminent success of the previous.

On the subject of which, I think you missed something from the first thing I said.
Code: [Select]
a=0b=1c=2became
Code: [Select]
a=0;b=1c=2
Note that it only added the semicolon once. The second time, nope. Just keeps ignoring further semicolons. Another simple fix, I say! Look how simple this is to get working if you stare at it with a magnifying glass for hours on end, testing everything over again each time to make sure two definitions in your grammar file aren't conflicting; making sure that no order of statements is going to cause total collapse.


I'll give you this, though: I did forget one thing last time. I totally forgot about allowing .7 as well as 0.7, which led to a semicolon being added. In retrospect, I should have been replacing .0 with 00 anyway, because 0.n is valid. Either way, boy did I feel stupid. That said, never did I manage to leave out half the goddamn code just because of "a single if statement that I left out".


Anyway, feel free to continue working on this. If it beats mine in a benchmark, -- that other silly little factor that just doesn't apply to something so great as a token tree -- I'll be happy to use it. And I know it doesn't matter how fast to you, but let me tell you, it certainly matters to me. Why?

I wish you luck with the C++ part of that. You'll need it. I've heard horror stories from "professionals" who were about as happy to hear of my parser's methodology as you, who tried it with the almighty token tree parser and had a record breaking parser time of ten seconds for an STL header. It'll be pretty hard for me to manage a time like that in one pass.

2570
General ENIGMA / Re: Enigma IDE (written in C++ using GTK+)
« on: August 21, 2009, 07:35:52 PM »
I was honestly bound to make that typo at some point.

Also, I don't see a problem with LGM, but I'm all for an official IDE. The hard part will be the room and path editors, I imagine.

2571
General ENIGMA / Re: Generated Parser
« on: August 20, 2009, 08:57:31 PM »
I stopped when a=0b=1c=2 came out as a=0;b=1c=2.

With my parser, if one works, they all work.

And on for() not being implemented in your parser... When I was done with if(), for() was added just by saying this:
Code: [Select]
      sy_semi=sy_semi->push(')','s');
      sy_semi=sy_semi->push(';','s');
      sy_semi=sy_semi->push(';','s');

Where if() was done with only the first line of that code.

The only thing you've demonstrated is how something can work once, and then just stop. The only way mine would ever stop working is if I managed to dis-align the code string and token string. But only a real twat would do that.

I'm writing the parser to do exactly what I want done. It's more focused than your all-purpose tree, and works independent of context. The context parsers are the syntax checker and the C Parser. Neither of those need to use a lexer; they are reading the code just as you and I do and writing things down as they go.

It would be aggravating, but doable, to write this GML-C++ formatter parser without doing any sort of lexing. But doing it this way is so much more general...

The only thing I'm not supporting from the entire C++ language with each of these parsers is literal suffixes, as in 100000000L or 0.5f. I find they're unnecessary anyway.

Even things like struct a {}; struct a b; rather than just a b; can all be so greatly generalized this way.

A tree just sounds like a great way to over-complicate things.


Edit:
I just tried a simple if statement to see if that'd work, and I noticed that it replaced = with ==. Which was nice, until I noticed it also deletes else{} statements. I'll drop dead before something that stupid happens in my parser.
And I don't care how easy it is to fix the goddamn thing. Point is, you somehow managed to delete else(), and that's not the first or only thing you can really fuck up with a token tree.

2572
General ENIGMA / Re: Enigma IDE (written in C++ using GTK+)
« on: August 20, 2009, 07:52:38 AM »
I can see it feeling that way because everything that goes on between LGM and ENIGMA is basically a hack.
We tried making ENIMGA a DLL for LGM to load, but that didn't go so well.
You and I probably need to try that again, Ism.

It also doesn't help that Java performs differently on every computer I've tried it on.

2573
Announcements / Re: Fresh Topic
« on: August 19, 2009, 10:04:23 PM »
Progress:
I'm signed up for all the courses I wanted at the local university.
Also, beginning the cross-communication cycle between the parsers. (Or pieces of parsers).

Storm's abrewin'.

Nothing earth shattering, sorry. Mostly boring things; making sure hex works. Trying nested statements and parentheses. Making sure none of the optimizations and generalizations I make kill anything.

Good thing I like parsers. ^_^

2574
Announcements / Re: Fresh Topic
« on: August 18, 2009, 07:17:29 AM »
I know a certain someone with a top hat fetish.

He also starves bugs to death, though.

Progress:
Parser does everything it used to. Only better, of course, due to extensibility via the CFile parser.
Oh, which means I lied. The parser doesn't support var yet, because var's not a primitive. I'm considering treating it like one, though, so var.whatever never accesses a member of var. Perhaps instead I'll make var's members private, and make sure the parser doesn't abduct any private members.

If you people need something to argue about, I can probably give one of my old controversies that making ENIGMA totally support C has solved... heheheh.

2575
Announcements / Re: Fresh Topic
« on: August 17, 2009, 03:33:36 PM »
Fedora isn't working at all, and the more I fight with it the angrier it gets.

I'm gonna reinstall the OS at some point, but for now I'm just working on the parser.

2576
General ENIGMA / Re: Mixed arrays
« on: August 13, 2009, 04:55:53 PM »
it's supported that since R1... Though it's been recoded multiple times since, so.

2577
Announcements / Re: Fresh Topic
« on: August 13, 2009, 04:51:32 PM »
Tried it? I had to blacklist it. Which fixed nothing. Prime.

2578
Announcements / Re: Fresh Topic
« on: August 13, 2009, 02:11:51 PM »
There's no correct driver
*starts mumbling about the apocalypse*

2579
Announcements / Re: Subject
« on: August 13, 2009, 02:10:26 PM »
We'll compromise and make the 1.3 release 32 days after the 1.2 release.

2580
Announcements / Re: Fresh Topic
« on: August 13, 2009, 08:38:06 AM »
zack--
It has four different pieces, which are basically separate parsers.
One's a syntax checker. It was about 30x faster than Mark's last time I banchmarked.
One's a CFile parser. Its job is to extract definitions from snippets of syntactically correct C++.
One's an expression evaluator. It does mathematics and things that need preprocessed.
This final one is a formatter. Its job is to make sure that even the nastiest GML you can pass it will be syntactically correct C++. It's by far the easiest to write.

Retro--
You can't use '0' to indicate 0x30, if that's what you mean. I might have a setting for that.

Score_under--
That's almost what I did. Instead of allocating things at the beginning, I just write stuff to the same buffer I'm using, since it's 90% removal. When I do that, I'll check if I'm at a begin or end and if so, only write a { or }.


Progress--
I've been trying to get Fedora to work with my graphics card. However, this has only landed me in hell, and I spent basically all of yesterday just trying to fix it. I'll probably give up and either use Ubuntu on my mom's computer, or fix Fedora to work without my card again. -_-