Прочетен: 939 Коментари: 0 Гласове:
Последна промяна: 02.02.2007 02:56
Lecture №10: Computational Lexicography: Concordances
Twelfth Session: 23.01.2007
Summary of the session: *-*
Quizzes: *-*
Homework: *-*
GO TO THE NEXT LECTURE: Lecture 11 *-*
Summary of the session:
This session is dedicated to the question: how to find the lexical information which is to be included in the lexical entries as as the microstructure and macrostructure of a dictionary? – by means of concordances (words in context): lexicographers look at different texts (fictional and non-fictional; books, newspapers, magazines, texts on the Internet).
-
REVIEW OF LEXICOGRAPHY PRINCIPLES
-
Criteria for good lexicography:
-
Quantity – completeness of coverage: extensional coverage (numbers of entries) and intensional coverage (number of types of lexical information) – is the macrostructure big enough
-
Quality – 1. correctness of information: types of lexical information 2. consistency of structure: macrostructure, microstructure, mesostructure – is the macrostructure broad enough
-
-
LEXICOGRAPHIC WORKFLOW CYCLE – there are several stages (four):
-
Data acquisition:
-
recordings
-
text collection
-
concordances
-
dictionaries
-
-
how to get the information/the vocabulary; different vocabulary in different types of language (spoken, written); so, we need recordings (spoken and written), then make concordances and after them we make dictionaries
-
Lexicon construction:
-
metadata
-
information – what kind of a dictionary we are gonna produce: semasiological or onomasiological
-
linguistic analysis
-
-
Access to data – what kind of media:
-
traditional print media
-
hyperlexicon – CD, Internet
-
software with lexicon component – word processing, speech processing
-
-
Lexical evaluation:
-
internal: 1. consistency 2. completeness
-
external: utility for the user
-
-
LEXICAL DATA ACQUISITION
-
From (text) corpus to lexicon:
-
Corpus Data:
-
Layer 1 – Primary Data (audio/video recordings): there are different corpora on the web or different companies have their own established corpora
-
Layer 2 – Secondary Data (written data – transcription, annotation, metadata)
-
-
Lexicon – from these texts we construct a lexicon:
-
Layer 1 – Corpus Lexicon (wordlist, concordances): a list of the words in the corpus
-
Layer 2 – Lexicon Matrix or Lexicon Table (entries x data categories, no generalizations): make a table from the information
-
Layer 3 – Lexicon with Selected Generalizations: decide what type of lexicon you want to produce, semasiological or onomasiological
-
Layer 4 – Lexicon with Generalization Hierarchies: integration of mesostructure in the dictionary itself; so, the dictionary becomes a huge network
-
-
-
FROM CORPUS TO LEXICON. CONCORDANCES
-
Concordance – a basic tool used by lexicographer for gaining material from texts; Examples: Biblical concordances, concordances of legal/literary texts
-
A KWIC (Key Word In Context) concordance is a special kind of preliminary, corpus-based dictionary: each word in a text corpus is paired with its contexts of occurrence in this corpus – collect the words from the text corpus plus their contexts. NOTE: Google is a special kind of KWIC concordance
-
An example text:
“Carrot pulled her aside as a couple of dwarfs approached the door purposefully.” (p.167) by Terry Pratchett, Thud!
Alphabetically ordered KWIC; keywords with right-hand contexts (3-word context):
Carrot pulled her aside
pulled her aside as
her aside as a
aside as a couple
as a couple of
a couple of dwarfs
couple of dwarfs approached
of dwarfs approached the
dwarfs approached the door
approached the door purposefully
the door purposefully
door purposefully
purposefully
-
CONCORDING ON THE WEB
-
the first: HyprLex, VerMobil HyprLex
-
some more: General information on concordancing, Corpus Linguistics
-
-
A KWIC CONCORDANCE ENGINE (from Internet)
-
KWIC concordance construction:
-
Corpus creation – create corpus
-
Tokenisation – token or units, corpus should be tokenised – individual words:
carrot pulled her aside as a couple of dwarfs approached the door purposefully
-
Keyword list extraction – collect all the words in a list
-
Context Collation – collect context we are interested in (preceding or following, a whole sentence, a paragraph, a text, only 3-4 words, etc.)
-
Keyword search
-
Output Formatting
-
SIMPLEST KWIC PROCEDURE
-
Corpus creation – make a corpus of texts in electronic format
-
Tokenisation (re-process each text) – 1. Process punctuation marks 2. Break the text into context units (lines/sentences)
-
Keyword list extraction (all words in text) – sort the list alphabetically and then remove all duplicate words
-
Context collation (for each keyword)
-
Search for KWIC in corpus
-
Store output and format – for printing, hypertext (CD, web)
-
COMPUTING A KWIC CONCORDANCE
Quizzes:
1. What is a KWIC concordance?
A KWIC concordance is a special kind of preliminary, corpus based dictionary – each word in a text corpus us paired with its contexts of occurrences in this corpus.
2. Which are the two main components of lexicon construction based on empirical data?
3. Which layers of abstraction are involved in corpus acquisition?
4. Which layers of abstraction are involved in lexicon construction? Describe them.
5. Which layer do standard dictionary types typically belong to?
6. What are the 6 main steps in KWIC concordance construction? Explain each of them!
-
Corpus creation: make a corpus of different texts. For example:
In Ankh-Morpork, greatest of its cities, spring was nudged aside by summer, and summer was prodded in the back by autumn. Terry Pratchett, Feet of Clay
He knew Nanny Ogg very well, but mainly as the person standing just behind Granny Weatherwax and smiling a lot. Terry Pratchett, Carpe Jugulum
Carrot pulled her aside as a couple of dwarfs approached the door purposefully. Terry Pratchett, Thud!
-
Tokenisation: re-process each text – 1. remove all punctuation marks 2. remove upper case letters 3. Break the text into context units (lines, sentences):
he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot
-
Keyword list extraction: first, create a list from the detokenised words; then, order the words alphabetically; and finally, remove the duplicate words:
Keyword list:
he
knew
nanny
ogg
very
well
but
mainly
as
the
person
standing
just
behind
granny
weatherwax
and
smiling
a
lot
Keyword list ordered alphabetically without any duplicate words:
a
and
as
behind
but
granny
he
just
knew
lot
mainly
nanny
ogg
person
smiling
standing
the
very
well
weatherwax
-
Context collation: pick up context units and write # at the beginning and at the end; pair each word from the keyword list with its contexts:
he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot
# he knew
he knew nanny
knew nanny ogg
nanny ogg very
ogg very well
very well but
well but mainly
but mainly as
mainly as the
as the person
the person standing
person standing just
standing just behind
just behind granny
behind granny weatherwax
granny weatherwax and
weatherwax and smiling
and smiling a
smiling a lot
a lot #
-
Keyword search: for example, search for "just" – it is found in the middle of the following context unit: "standing just behind"
-
Output formatting – store output and format:
Tabulated table:
a smiling a lot
and weatherwax and smiling
as mainly as the
behind just behind granny
but well but mainly
granny behind granny weatherwax
he # he knew
just standing just behind
knew he knew nanny
lot a lot #
mainly but mainly as
nanny knew nanny ogg
ogg nanny ogg very
person the person standing
smiling and smiling a
standing person standing just
the as the person
very ogg very well
well very well but
weatherwax granny weatherwax and
Normal table:
word |
preceding context |
KWIC |
following context |
a |
smiling |
a |
lot |
and |
weatherwax |
and |
smiling |
as |
mainly |
as |
the |
behind |
just |
behind |
granny |
but |
well |
but |
mainly |
granny |
behind |
granny |
weatherwax |
he |
# |
he |
knew |
just |
standing |
just |
behind |
knew |
he |
knew |
nanny |
lot |
a |
lot |
# |
mainly |
but |
mainly |
as |
nanny |
knew |
nanny |
ogg |
ogg |
nanny |
ogg |
very |
person |
the |
person |
standing |
smiling |
and |
smiling |
a |
standing |
person |
standing |
just |
the |
as |
the |
person |
very |
ogg |
very |
well |
well |
very |
well |
but |
weatherwax |
granny |
weatherwax |
and |
Homework:
The text for analysis:
Ginger beer
Fermentation has been used by mankind for thousands of years for raising bread, fermenting wine and brewing beer. The products of the fermentation of sugar by baker"s yeast Saccharomyces cerevisiae (a fungus) are ethyl alcohol and carbon dioxide. Carbon dioxide causes bread to rise and gives effervescent drinks their bubbles. This action of yeast on sugar is used to "carbonate" beverages, as in the addition of bubbles to champagne).
Task:
Discuss:
semantic components – components of fermentation are raising bread, fermenting wine, brewing beer; components of fermentation of sugar by baker"s yeast are ethyl alcohol and carbon dioxide;
semantic components of SDD for words from the text:
wine – an alcoholic drink made from grapes, or a type of this drink: e.g. a glass of wine; e.g. a delicious Californian wine; hyponyms: red/white wine, dry/sweet/sparkling wine
components:
wine is definiendum
drink is genus proximum
alcoholic...made from grapes – differentia specifica
the examples are a contextual component of the definition or else the contextual/syntagmatic definition
the hyponyms red wine, white wine, dry wine, sweet wine, sparkling wine are paradigmatic components of the definition or they are paradigmatic definitions
semantic fields – beverages: wine, beer, ethyl alcohol, champagne; products of fermentation: bread, wine, beer, ethyl alcohol, carbon dioxide;
semantic relations:
sugar-salt: co-hyponyms, antonyms
wine-champagne-beer: co-hyponyms, antonyms
drink-beverage: co-hyponyms, synonyms
beverages-beer, wine, champagne: a hyperonym (superordinate term) and 3 hyponyms (subordinate terms)
champagne-bubbles: meronyms (bubbles appear to be parts of the champagne according to the text)
definitions:
champagne – a French white wine with a lot of bubbles, drunk on special occasions; any of various effervescent wines, such as champagne: e.g. In the novels, all the naughty people take champagne and oysters.
SDD - a French white wine with a lot of bubbles, drunk on special occasions
Syntagmatic (contextual definition) - e.g. In the novels, all the naughty people take champagne and oysters.
Paradigmatic definition - any of various effervescent wines, such as champagne (champagne is a hyponym of effervescent wines).
27.01.2006 /font>http://www.thefreedictionary.com/bubbly