Is polymorphism appropriate for modeling natural language form (structure), or is there something better?

  softwareengineering

Let’s take French and Japanese adjectives as simple examples.

French adjectives have gender (masculine or feminine) and number (singular or plural), whereas Japanese nouns have neither. However, Japanese adjectives belong to particular groups, called i-group or na-group. So, we could theoretically model this as simply (Java syntax):

class FrenchAdjective {
    Gender gender; // enum
    Plurality plurality; // enum
    ...
}

class JapaneseAdjective {
    AdjectiveGroup adjectiveGroup; // enum
    ...
}

There may be some shared attributes between all adjectives of any language (though I can’t think of any right now). Does it make sense to extend these classes from a base class Adjective to make them polymorphic, in the event that there is actually some shared behavior between them?

My concern in this is that as more languages are added, attributes that were common in all existing languages in the implementation may not be shared by the new language, meaning that the attribute would need to be moved into all subclasses and eliminated from the superclass, leading from no duplication to potentially lots of duplication. That is, unless some other subclass of Adjective were introduced, say IndoEuropeanAdjective to handle , which could turn to nonsense pretty quickly.

Is there a better method to model natural languages, or is polymorphism the correct approach?

Edit: the purpose of this is not for any computational linguistics task. It is an effort to model basic word forms for a multilingual dictionary. This is not meant to be extensible to every language, but rather a very small subset, of widely-spoken languages, e.g. French, Spanish, German, Japanese, Mandarin Chinese, etc.

2

Completely reviewed answer following the edit of the question.

Linguistic and language processing issues

It would be very challenging to identify common properties and behaviors of adjectives, that would fit
all possible human languages:

  • Linguists seem to agree that not all languages use what western
    grammatical experts and linguist call “adjective” (example: some Amerindian languages).
  • The adjective’s semantic (“meaning”) can be expressed by other means across the languages
    (for example suffixes in Dutch or German, or by embedding the adjective in a larger noun
    as in German, as shows the German noun “Rotkäpchen” which translates to “Little red riding hood” and hence embeds 3 adjectives).

But if you’re not working on a universal NLP processor nor on a semantic analyzer, but only on some adjective
generator/matcher/dictionary, it is feasible to find a common polymorphic interface for your
adjectives.

Linguistic properties to be aware of

You have already identified some language dependent properties:

  • Gender: masculine, feminine in French ; add neutral for German;and have only one form in English (one neutral ? or masculine=feminine?)
  • Plurality: singular, plural (always relevant, but doesn’t necessarily change the form of the adjective in some languages)
  • (Structural) Adjective Group: i- and na- group relevant only for Japanese. This seems structural to the adjective, i.e. determines how the form of the adjective is to be derived in general (i.e. related to the adjective itself,and not to the context to which it has to be adapted). I wonder if other structural groups could apply for other languages (e.g. participles in French or English are special forms of a verb that are used exactly as an adjective; invariant adjectives is a category of adjective that doesn’t change its form depending on the context like for example “same” )

You should also consider the following properties:

  • Declension: exists in many languages such as Latin, German, Dutch, Greek, Russian. This relates to the grammatical role of the nominal group/entity to which the adjective is related (e.g. Subject, direct object, genitive, …). Of course all these roles are also language specific
  • Position: for example, in French, an adjective for a given gender can have a different form if it’s before the noun it qualifies or after (e.g. “un bel homme” and “un homme beau” which are different forms of the same adjective – not to be confused with the feminine forme “belle”).
  • Other contextual information: in german the form of and adjective for the same position, same declension, same Gender and same plurality, can depend in a nominal group of whether there’s a definite article or not (e.g. “das nette Kind” vs. “ein nettes Kind”).

How to introduce some polymorphism here ?

Polymorphism is about behavior. But for the moment, we only spoke about properties. So to define a polymorphic interface you have to first think about what you want to do. In the case of a dictionary, I suppose you want to:

  • generate a specific form of the adjective, given a set of properties relevant to the language.
  • generate all the possible forms of an adjective, for all the valid combinations of properties.

The first means that you should factor out of the adjective all the properties related to its context of use, and keep in the adjective only structural properties.

The second means that you need an iterator class, that will iterate through the different combination of context properties valid/relevant for the language conforms of the article. The iterator would be language dependent and iterate through he valid combination of contextual properties relevant for the language:

What you will have is then 3 abstract classes: Adjective,AdjectiveContext, and ContextIterator. Something like:

class Adjective {
private:                        // not here: structural groups are too language specific
public: 
    virtual string generateForm (const AdjectiveContext& c)=0;  // abstract method 
    virtual ContextIterator getIterator ()=0;          // abstrace, returns appropriate iterator (either general for language or specific for adjective)            
}; 

class AdjectiveContext {
private: 
    Gender gender;          // if not used in language, neutral or special value n.a. 
    Plurality purarlity;    // if not used in language, special value meaning n.a. 
                            // declension etc..; would be language specific.   
public: 
    // constructors and getters for the common properties 
    ...

}; 

class ContextIterator {
public:  
    virtual AdjectiveContext& first() =0; 
    virtual AdjectiveContext& next() =0;
    virtual bool isLast() =0; 
};

This C++ sample code, is just an illustration about the principles. Of course you can adapt to your favorite language (and fine tune C++ to clarify ownership of generated objects).

You can then build an application by deriving the Adjective by language, and eventually further specializing some families of adjective (e.g. i-group na-group could indeed be a specialization of the JapanesAdjective class instead of a property). Of course, the JapanesAdjective should covariantly return a polymorphic ContextIterator which in reality refers to a JapaneseContextIterator which shall return a JapaneseAdjectiveContext reference.

4

Maybe a tag-based approach would work better than a class hierarchy.

I’m imagining that all your dictionary entries / terms carry around a set of tags, which consist of a tag type and semantic information. So if you have a specific french adjective such as “grand”, you can tag it as an adjective, have a tag whose semantic value points to the canonical form, give it a gender tag with a semantic value of “masculine”, tag it with the plurality tag “singular” etc.

I’m not quite sure what kind of common functionality you hope to bundle in common superclasses of FrenchAdjective, GermanAdjective, EnglishAdjective etc, but it seems to me that composition using a tagging system, rather than a is-a class hierarchy, is better suited to the task of extending your dictionary when necessary. For example, you might someday decide that your dictionary should also contain references to synonyms, example phrases etc; if you have defined a class hierarchy, these things will probably not fit into it.

If you think about it, adding tags which consist of a type and semantic information makes your dictionary a set of triples, each consisting of a subject (the word), a relation (the tag) and an object (the semantic data you want to attach to the word). So basically you’d be building a graph, which is more expressive than a tree-like hierarchy and probably better suited to model language, which is full of exceptions to rules and full of associations between individual terms which (I suspect) can be modeled much better with a graph than a tree.

LEAVE A COMMENT