Postby punjaporn » 24 Feb 2017 14:02

Hi all,
I am looking for ways of grouping a long list of linguistic and non-linguistic features of words.
For example, the 3 different features: (1) number of letters in a word, (2) words coverage in the 1000 most frequently used English word list, (3) number of senses of a word (polysemy) could be measured to reveal ‘Word difficulty’. Having more than 200 features, I need to group them in some way to help me interpret them as a group, but not as individuals.

Previous literature investigated, e.g. ‘difficulty’, ‘complexity’, ‘formality’, ‘sophistication’ through several measures. But, these terms were actually studied in individual papers and many of them overlap.

I think I am having 2 options to begin with. First, I can look at each of the features and make a list of the terms that the features could reflect. The problem is that one feature also reveals various aspects. So, the second option, having an initial list of terms and placing the features under the terms seem to be more practical. If you know any resources for a single common category of features or ideas on how should I choose the umbrella terms for features, please let me know. Thank you so much.
Re: linguistic and non-linguistic feature classification

Postby sgtowns » 25 Feb 2017 14:38

I have struggled with a similar issue in my PhD research. The terms used in the literature are really confusing because, as you said, they overlap a lot depending on when the paper was written and who wrote it.

Instead of thinking about your two options of 1) features with list of categories or 2) categories with lists of features, maybe you can make a spreadsheet with features (word coverage, numbers of letters in a word) in rows with your categories (difficulty, diversity, sophistication) as columns, and then mark which ones fit where. Then, you can probably move rows and columns around and start to see some patterns about which features group together more often than not (I am guessing).

I can't figure out how to format a table here, so I took a picture of what the spreadsheet might look like:
For an idea about what categories to use, Coh-Metrix separates their 108 linguistic features into 11 top-level categories:

1. Descriptive
2. Text Easability Principal Component Scores
3. Referential Cohesion
4. LSA
5. Lexical Diversity
6. Connectives
7. Situation Model
8. Syntactic Complexity
9. Syntactic Pattern Density
10. Word Information
11. Readability

(For more info on the 108 features, see ... dices.html)

Coh-Metrix comes from a psycholinguistics perspective, so I think that it would be better to do the spreadsheet idea above and make your own categories. You might come up with additional / different insights.

I don't know of any articles or books that have a matrix like this. It would basically be a meta-analysis of previous literature and could be a very interesting study and very helpful for others who are researching linguistic features.
Re: linguistic and non-linguistic feature classification

Postby punjaporn » 26 Feb 2017 00:06

Stuart, thank you so much for the very useful idea about the spreadsheet with the clear examples. The link to Coh-metrix devices also gives me good ideas of how I can present my features and their description.

Also, I completely agree with you that a meta-analysis of linguistic features investigated in previous research could be very useful and interesting. So now you have your next research topic! :)
