What is a good strategy for reading XML like hiearchical text data?

I want to read data in a format like the following using Java.

     name=_"My First Scenario."
         user_team_name= _ "My Team"
         name= _ "My Leader's Name"
         type="Elvish Ranger"
         recruit="Elvish Fighter, Elvish Archer, Elvish Shaman"
        user_team_name= _ "Bad Guys"
        name= _ "My Villain"
        type= "Orcish Warrior"
        recruit="Orcish Grunt, Orcish Archer, Orcish Assassin, Wolf Rider"

I want to develop an API that would read such content in a generic manner, for example, having methods like getChildren, getAttributes. I’m wondering if there are libraries that support this kind of task.

Following are what I have come up with,

  • Since this is a simple language of it’s own (like XML), should I use a library like antlr? Or is that too complex for this task?

  • Should I use regex for parsing this data?

  • Should I process the text manually as a stream, and identify the tags/attributes as they arrive?

  • Or is there a better/different way than all above?

For anyone who is interested this markup language is used in a game called Battle for Wesnoth, which is in c++. I want to parse this data using Java.

The game is open source software, so you should download their source code, locate their code that parses it, and either port it from C++ to Java, or add a C++ component to your project in some way. The latter would probably be preferable, because it would enable you to easily incorporate updates from the game if they change or extend their markup language.

This will most likely be much easier than trying to write your own parser. This looks like a very complex, custom markup language, based on their docs.

Also, it would probably be worthwhile to contact the game’s development team and get some feedback about your idea. It may be something they are interested in, or they may at least have some advice about how to proceed.

Update: you probably can’t avoid parsing the whole syntax because you are doing something “simple”.

Suppose you just want to capture all cases of type=[value]. It seems simple, right? Unfortunately, no. The user can define a macro like so:



This is the equivalent of:


So even the simplest capture will need to understand the whole syntax, if you want it to be fully correct.


I’d just write my own parser considering there are essentially only two (three) possible cases identified by the first non whitespace character in a line:

  • If it’s a bracket, create a new children and make it current or go back to the parent.
  • If it’s a letter, read the attribute for the current node.

Also only use raw text parsing. Using regular expressions would be overhead for very little gain IMO.


This format is similar to YAML. Consider converting to YAML using a regular expression and using a standard YAML parser. You will need to encode translatable text denoted by an underscore and IDs (non-quoted text literals).

Learn YAML

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *