Spanish (ES) Language Data

Inflectional Morphology Data

The Lexical Resource for Spanish contains all the standard inflectional forms for nouns, verbs, adjectives, prepositions, conjunctions, etc.

Derivational Morphology Data

Contains all the standard derivational forms including adverbs ending in “-mente” and superlatives.

 

Extended Morphology Data

Contains the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as clitic pronouns.

Frequency Indication

Contains the data regarding the relative frequency of appearance for the words in the above lists in the given language.

Each word has been assigned a frequency group, where the frequency group corresponds to a normalized logarithmic scale from 0 to 255. The most frequent word in the corpus has been assigned frequency group 255, and words not appearing in the corpus have been assigned frequency group 0.

 

Complementary Semantic Annotations

 

Named Entities Morphology Data

Contains the data regarding named entities comprising person names, places, companies and organizations.

Offensive Language Flag

Contains information per word indicating if the word might be considered offensive in certain contexts. 

Regional Variants

In addition to the lexical data for Spanish, the Lexical Resource also contains the equivalent lexical data for the following dialects:  

 

      • North America: Mexico, USA and Puerto Rico 
      • Central American and Caribbean: Guatemala, Honduras, El Salvador, Nicaragua, Costa Rica, Panama, Cuba and the Dominican Republic 
      • Andes: Venezuela, Ecuador, Peru, Bolivia and Colombia 
      • Southern Cone: Argentina, Paraguay, Uruguay and Chile (including “voseo” forms) 

Volume of Language Data

lexical-forms-simplified-spanish

Total number of forms

2,500,000 forms

 

      • Verbs: 2,250,000 forms (90%)
      • Nouns: 120,000 forms (4%)
      • Adjectives: 120,000 forms (4%)
      • Other: 50,000 forms (2%)
number-of-lemmas-arabic-lexical

Total number of lemmas

60,000 lemmas

Features

Each form will be annotated with the lemma (root form), POS, and morphological attributes (tense, mood, person, number, gender). 

h

Lemma

The canonical form for the inflected word.

{

POS

Part of Speech such as noun, verb, adjective, etc.

v

Voice

Not applicable.

+

Tense

Specifies when the action takes place such as past, present, future, etc.

Aspect

Not applicable.

Mood

Modality of the verb form: indicative, subjunctive, imperative, etc

Person

Verb or pronoun refers to the first, second or third person.

Number

State of being singular, dual or plural.

Gender

Noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.

Case

Not applicable.

R

Degree

Not applicable.

l

Definiteness State

Not applicable.

O

Negative

Not applicable.

|

Contractions

Not applicable.

Pronominal Clitics

Clitic pronouns are identified and tagged.

w

Formality

Not applicable.

Frequency

Relative frequency of the form based on a large general-purpose corpus.

Named Entities

Pre-defined entities are tagged as person names, places, organization, etc.

r

Offensive

Indicates whether the form might be considered offensive in certain contexts.

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA