The ENIGMA project has hit a snag recently. Basically, our collective ideologies have become tangled.
It looks something like this:
.
Ism and I are both sorting issues with our projects. Ism is dealing with Java's ill-equipped generics, trying to structure LGM to be more extensible for future releases. As far as you need to be concerned, this is so we can add our Definitions and Overworld resources, as well as other potential resources down the road.
As for myself, right now I am dealing with a small issue regarding a dynamic type that implements implicit accessors and setters for generic data, which has presented a number of "how should I"s, but in the big picture, my concern is for the progression of ENIGMA's parser.
One point is certain: I need to rewrite or do serious work on the C parser. ENIGMA is presently ill-equipped to make distinctions between the following sample lines:
(id)
(int)
(show_message)
(100001)
This has led to issues in interpreting some GM6 code unless extraneous parentheses are added, which is unacceptable to newcomers, as well as to issues implementing some of C++'s more desirable features, such as complicated ternary expressions. This is because ENIGMA needs information from C.
The purpose of the C++ parser—be it the current one, a rewrite, or Clang—is to provide ENIGMA's EDL parser with information about available types, functions, and other constructs. For example, ENIGMA would not know what
var a; meant if the C++ parser couldn't understand var's header, and it would not understand what
show_message was if it could not read function definitions. The parser also needs to be able to resolve complicated types for further error checking. Hence, we need some mechanism of parsing C++ sources accurately enough for ENIGMA to produce correct C++ code.
I am faced with two options which I have filtered out as the best.
- Rewrite the C++ parser to support the new ISO and to better support the old.
This will mean partitioning ENIGMA's compiler into two projects—one to parse EDL, and one to work with C++—and maintaining them both. I would choose to split the two so that any other party interested in parsing C++ could potentially join in, though it has proven in the past to be unlikely for this to happen.
The benefits of doing this would include minimalistic size and features tailored specifically to ENIGMA and similar projects, meaning—if I code it right—it could potentially work faster than Clang due to a lack of need to lex and check code inside functions. The resulting project would also be less than a megabyte in size (ENIGMA's current hand-rolled C++ parser is 300 KB of source).
- Drop the existing C parser altogether and outsource to Clang.
Clang is an LLVM frontend. Certain members of this forum, I'm sure you're aware, blow lots of smoke about LLVM, but in general we would want to avoid it because it will run users somewhere between half a gigabyte and a complete gigabyte of disk space, and MinGW LD/MSys would still be necessary (which is presently the largest component of ENIGMA).
Clang, and its support for LLVM, would bring unprecedented benefit to the project, but not unwarranted benefit given its size. If ENIGMA was compatible with LLVM, by nature of the huge amount of support for the LLVM project, ENIGMA would be able to use, export to, and interact with a half a dozen other languages, notably JavaScript, Lua, Python, and an interpreter for C++ which, depending on its speed, could mean a native method of doing execute_string(), or could just be something crappy to avoid (we have only just learned of its existence, and I have little faith in the ability of anything to both parse and interpret C++ in a reasonable timeframe). In short, we would be getting not only C++ functions, but Lua and JavaScript functions as well, if desired.
The issue is not only the huge size. Ostensibly, I could invest a few hours each update into isolating the segments of Clang required to simply parse C++ and give me info about it. Therein lies the issue; I am well aware that at this point, Clang sounds like the clear choice, but whether or not I choose Clang, I am left with something to maintain. Getting just libclang, its Clang dependencies, and the necessary headers from the LLVM svn to compile alongside ENIGMA will take work, research, trial, and error, and will cost about 50MB in the SVN. Then any time Clang updates something, I have to try updating my copy without stomping all over the modifications. I can't measure at the moment how messy it will be; only that the process is not streamlined.
That said, I am torn between the two options. I need people to say, "50 MB and a potential shitton problems that aren't yours, in exchange for four languages? That sounds worth it to me!" Or to say, "for that price, just write your own."
TL;DR version:(1) Custom | (2) Clang |
Tiny (Less than 1MB); fast, pointed runtime | 50 MB; Parses EVERYTHING, though quickly |
Gives precisely the needed information, no more, no less | Gives general information that can likely be used to meet all of ENIGMA's purposes. |
| Supports interfacing with other languages (Lua, Python, JavaScript) at the cost of hundreds of megabytes on top of Clang |
Likely to be sole maintainer, responsible for all aspects including any potential errors. This would be no different from now. At worst, it could mean a second recode in the future, but ideally I would make the code sufficiently extensible to prevent that this time. | Maintenance involves separating Clang from LLVM as cleanly as possible every time an update is made; any parse errors are not the responsibility of the ENIGMA team, and may or may not be dealt with in a timely manner. Potentially, we'd be facing another MinGW fiasco. (See #13297) |
Additional Q/A:dazappa: Clang is a "frontend to LLVM." So you would use Clang but not LLVM? And what's the final size decision, 1gb or 50mb?
Josh@Dreamland: Well, clang has LLVM dependencies, so I would be cutting LLVM into little pieces and throwing away the ones I don't need. 50MB is the projected size after I throw away the little pieces.
dazappa: Would you rather maintain 50mb of shit that you don't know, or 1mb of shit you do?
Josh@Dreamland: Good question. Ideally, for Clang, the maintenance would just be updating Clang headers, adding any more pieces of LLVM that become necessary, and making sure it compiles as though the configure script had run. Maintaining a megabyte of C++ parser can be just as difficult, if not more so as my responsibility extends beyond making sure it simply compiles.
dazappa: Well, you failed to write the C++ parser happily the first time, so you think you'd be able to do it better the second time? Using clang might save you time if you can get it setup and be able to easily update the headers like you think.
Josh@Dreamland: I do think I'd be able to do better the second time. As you can see, the current version succeeds for the most part, with one warning it throws three times. If I code the second version knowing everything I do from the first, and with all that shit in mind, I should be able to get it to play nice.
dazappa: ...Until you realize you have to rewrite it for a 3rd time.
Josh@Dreamland: I'm going to use a system very similar to the recursive descent scheme Rusky talks about. Basically, it would use the body of the current C parser, which invokes a function to handle the token in the context of each statement. Instead of calling this big mash-up switch statement that makes a hundred if checks, the parser would call one function based on the current context and pass it the token, and that function would deal with it appropriately. This means adding a new type of statement to the list would be pretty easy.
Go to town.