Enter the Table of Contents for easier navigation

За този блог

Автор: pkarpuzov
Категория: Технологии
Прочетен: 59903
Постинги: 31
Коментари: 0

Гласове: 4

Блог вълни

Blog.bg

Регистрирай си безплатно блог в Blog.bg

Постинг

Обратно към блога | Предишен постинг | Следващ постинг

11.11.2006 18:31 - LEARNER'S DIARY: LECTURE 10

Автор: pkarpuzov Категория: Технологии

Прочетен: 939 Коментари: 0 Гласове:

Последна промяна: 02.02.2007 02:56

Lecture №10: Computational Lexicography: Concordances

Twelfth Session: 23.01.2007

Summary of the session: *-*

Quizzes: *-*

Homework: *-*

GO TO THE NEXT LECTURE: Lecture 11 *-*

Summary of the session:

This session is dedicated to the question: how to find the lexical information which is to be included in the lexical entries as as the microstructure and macrostructure of a dictionary? – by means of concordances (words in context): lexicographers look at different texts (fictional and non-fictional; books, newspapers, magazines, texts on the Internet).

REVIEW OF LEXICOGRAPHY PRINCIPLES
Criteria for good lexicography:
- Quantity – completeness of coverage: extensional coverage (numbers of entries) and intensional coverage (number of types of lexical information) – is the macrostructure big enough
- Quality – 1. correctness of information: types of lexical information 2. consistency of structure: macrostructure, microstructure, mesostructure – is the macrostructure broad enough
LEXICOGRAPHIC WORKFLOW CYCLE – there are several stages (four):

Data acquisition:
1. recordings
2. text collection
3. concordances
4. dictionaries

how to get the information/the vocabulary; different vocabulary in different types of language (spoken, written); so, we need recordings (spoken and written), then make concordances and after them we make dictionaries

Lexicon construction:
1. metadata
2. information – what kind of a dictionary we are gonna produce: semasiological or onomasiological
3. linguistic analysis
Access to data – what kind of media:
1. traditional print media
2. hyperlexicon – CD, Internet
3. software with lexicon component – word processing, speech processing
Lexical evaluation:
1. internal: 1. consistency 2. completeness
2. external: utility for the user

LEXICAL DATA ACQUISITION

From (text) corpus to lexicon:
- Corpus Data:
  - Layer 1 – Primary Data (audio/video recordings): there are different corpora on the web or different companies have their own established corpora
  - Layer 2 – Secondary Data (written data – transcription, annotation, metadata)
- Lexicon – from these texts we construct a lexicon:
  - Layer 1 – Corpus Lexicon (wordlist, concordances): a list of the words in the corpus
  - Layer 2 – Lexicon Matrix or Lexicon Table (entries x data categories, no generalizations): make a table from the information
  - Layer 3 – Lexicon with Selected Generalizations: decide what type of lexicon you want to produce, semasiological or onomasiological
  - Layer 4 – Lexicon with Generalization Hierarchies: integration of mesostructure in the dictionary itself; so, the dictionary becomes a huge network
FROM CORPUS TO LEXICON. CONCORDANCES
- Concordance – a basic tool used by lexicographer for gaining material from texts; Examples: Biblical concordances, concordances of legal/literary texts
- A KWIC (Key Word In Context) concordance is a special kind of preliminary, corpus-based dictionary: each word in a text corpus is paired with its contexts of occurrence in this corpus – collect the words from the text corpus plus their contexts. NOTE: Google is a special kind of KWIC concordance

An example text:

“Carrot pulled her aside as a couple of dwarfs approached the door purposefully.” (p.167) by Terry Pratchett, Thud!

Alphabetically ordered KWIC; keywords with right-hand contexts (3-word context):

Carrot pulled her aside

pulled her aside as

her aside as a

aside as a couple

as a couple of

a couple of dwarfs

couple of dwarfs approached

of dwarfs approached the

dwarfs approached the door

approached the door purposefully

the door purposefully

door purposefully

purposefully

CONCORDING ON THE WEB
- the first: HyprLex, VerMobil HyprLex
- some more: General information on concordancing, Corpus Linguistics
A KWIC CONCORDANCE ENGINE (from Internet)
KWIC concordance construction:

Corpus creation – create corpus
Tokenisation – token or units, corpus should be tokenised – individual words:

carrot pulled her aside as a couple of dwarfs approached the door purposefully
Keyword list extraction – collect all the words in a list
Context Collation – collect context we are interested in (preceding or following, a whole sentence, a paragraph, a text, only 3-4 words, etc.)
Keyword search
Output Formatting

SIMPLEST KWIC PROCEDURE

Corpus creation – make a corpus of texts in electronic format
Tokenisation (re-process each text) – 1. Process punctuation marks 2. Break the text into context units (lines/sentences)
Keyword list extraction (all words in text) – sort the list alphabetically and then remove all duplicate words
Context collation (for each keyword)
Search for KWIC in corpus
Store output and format – for printing, hypertext (CD, web)

COMPUTING A KWIC CONCORDANCE

back to the beginning

Quizzes:

1. What is a KWIC concordance?

A KWIC concordance is a special kind of preliminary, corpus based dictionary – each word in a text corpus us paired with its contexts of occurrences in this corpus.

2. Which are the two main components of lexicon construction based on empirical data?

3. Which layers of abstraction are involved in corpus acquisition?

4. Which layers of abstraction are involved in lexicon construction? Describe them.

5. Which layer do standard dictionary types typically belong to?

6. What are the 6 main steps in KWIC concordance construction? Explain each of them!

Corpus creation: make a corpus of different texts. For example:

In Ankh-Morpork, greatest of its cities, spring was nudged aside by summer, and summer was prodded in the back by autumn. Terry Pratchett, Feet of Clay

He knew Nanny Ogg very well, but mainly as the person standing just behind Granny Weatherwax and smiling a lot. Terry Pratchett, Carpe Jugulum

Carrot pulled her aside as a couple of dwarfs approached the door purposefully. Terry Pratchett, Thud!
Tokenisation: re-process each text – 1. remove all punctuation marks 2. remove upper case letters 3. Break the text into context units (lines, sentences):

he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot
Keyword list extraction: first, create a list from the detokenised words; then, order the words alphabetically; and finally, remove the duplicate words:

Keyword list:

he

knew

nanny

ogg

very

well

but

mainly

as

the

person

standing

just

behind

granny

weatherwax

and

smiling

a

lot

Keyword list ordered alphabetically without any duplicate words:

a

and

as

behind

but

granny

he

just

knew

lot

mainly

nanny

ogg

person

smiling

standing

the

very

well

weatherwax
Context collation: pick up context units and write # at the beginning and at the end; pair each word from the keyword list with its contexts:

he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot

# he knew

he knew nanny

knew nanny ogg

nanny ogg very

ogg very well

very well but

well but mainly

but mainly as

mainly as the

as the person

the person standing

person standing just

standing just behind

just behind granny

behind granny weatherwax

granny weatherwax and

weatherwax and smiling

and smiling a

smiling a lot

a lot #
Keyword search: for example, search for "just" – it is found in the middle of the following context unit: "standing just behind"
Output formatting – store output and format:

Tabulated table:

a smiling a lot

and weatherwax and smiling

as mainly as the

behind just behind granny

but well but mainly

granny behind granny weatherwax

he # he knew

just standing just behind

knew he knew nanny

lot a lot #

mainly but mainly as

nanny knew nanny ogg

ogg nanny ogg very

person the person standing

smiling and smiling a

standing person standing just

the as the person

very ogg very well

well very well but

weatherwax granny weatherwax and

Normal table:

word

preceding context

KWIC

following context

a	smiling	a	lot
and	weatherwax	and	smiling
as	mainly	as	the
behind	just	behind	granny
but	well	but	mainly
granny	behind	granny	weatherwax
he	#	he	knew
just	standing	just	behind
knew	he	knew	nanny
lot	a	lot	#
mainly	but	mainly	as
nanny	knew	nanny	ogg
ogg	nanny	ogg	very
person	the	person	standing
smiling	and	smiling	a
standing	person	standing	just
the	as	the	person
very	ogg	very	well
well	very	well	but
weatherwax	granny	weatherwax	and

back to the beginning

Homework:

The text for analysis:

Ginger beer

Fermentation has been used by mankind for thousands of years for raising bread, fermenting wine and brewing beer. The products of the fermentation of sugar by baker"s yeast Saccharomyces cerevisiae (a fungus) are ethyl alcohol and carbon dioxide. Carbon dioxide causes bread to rise and gives effervescent drinks their bubbles. This action of yeast on sugar is used to "carbonate" beverages, as in the addition of bubbles to champagne).

Task:

Discuss:

semantic components – components of fermentation are raising bread, fermenting wine, brewing beer; components of fermentation of sugar by baker"s yeast are ethyl alcohol and carbon dioxide;

semantic components of SDD for words from the text:

wine – an alcoholic drink made from grapes, or a type of this drink: e.g. a glass of wine; e.g. a delicious Californian wine; hyponyms: red/white wine, dry/sweet/sparkling wine

components:

wine is definiendum

drink is genus proximum

alcoholic...made from grapes – differentia specifica

the examples are a contextual component of the definition or else the contextual/syntagmatic definition

the hyponyms red wine, white wine, dry wine, sweet wine, sparkling wine are paradigmatic components of the definition or they are paradigmatic definitions

semantic fields – beverages: wine, beer, ethyl alcohol, champagne; products of fermentation: bread, wine, beer, ethyl alcohol, carbon dioxide;

semantic relations:

sugar-salt: co-hyponyms, antonyms

wine-champagne-beer: co-hyponyms, antonyms

drink-beverage: co-hyponyms, synonyms

beverages-beer, wine, champagne: a hyperonym (superordinate term) and 3 hyponyms (subordinate terms)

champagne-bubbles: meronyms (bubbles appear to be parts of the champagne according to the text)

definitions:

champagne – a French white wine with a lot of bubbles, drunk on special occasions; any of various effervescent wines, such as champagne: e.g. In the novels, all the naughty people take champagne and oysters.

SDD - a French white wine with a lot of bubbles, drunk on special occasions

Syntagmatic (contextual definition) - e.g. In the novels, all the naughty people take champagne and oysters.

Paradigmatic definition - any of various effervescent wines, such as champagne (champagne is a hyponym of effervescent wines).

27.01.2006 /font>http://www.thefreedictionary.com/bubbly

Нагоре

Гласувай:

Следващ постинг

Предишен постинг

LEARNER'S DIARY: LECTURE 9

LEARNER'S DIARY: LECTURE 11

Коментари

Няма коментари

Търсене

Архив

Календар

Април, 2024

П	В	С	Ч	П	С	Н
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Блогрол

1. Iron Maiden's official website
2. Avantasia - The Metal Opera