Потребителски вход

Запомни ме | Регистрация
За този блог
Автор: pkarpuzov
Категория: Технологии
Прочетен: 59903
Постинги: 31
Коментари: 0
Гласове: 4
Постинг
11.11.2006 18:31 - LEARNER'S DIARY: LECTURE 10
Автор: pkarpuzov Категория: Технологии   
Прочетен: 939 Коментари: 0 Гласове:
0

Последна промяна: 02.02.2007 02:56


 

Lecture №10: Computational Lexicography: Concordances

Twelfth Session: 23.01.2007

Summary of the session: *-*

Quizzes: *-*

Homework: *-*

 


GO TO THE NEXT LECTURE: Lecture 11 *-*

 

Summary of the session:

This session is dedicated to the question: how to find the lexical information which is to be included in the lexical entries as as the microstructure and macrostructure of a dictionary? – by means of concordances (words in context): lexicographers look at different texts (fictional and non-fictional; books, newspapers, magazines, texts on the Internet).

  • REVIEW OF LEXICOGRAPHY PRINCIPLES

  • Criteria for good lexicography:

    • Quantity – completeness of coverage: extensional coverage (numbers of entries) and intensional coverage (number of types of lexical information) – is the macrostructure big enough

    • Quality – 1. correctness of information: types of lexical information 2. consistency of structure: macrostructure, microstructure, mesostructure – is the macrostructure broad enough

  • LEXICOGRAPHIC WORKFLOW CYCLE – there are several stages (four):

  1. Data acquisition:

    1. recordings

    2. text collection

    3. concordances

    4. dictionaries

  • how to get the information/the vocabulary; different vocabulary in different types of language (spoken, written); so, we need recordings (spoken and written), then make concordances and after them we make dictionaries

  1. Lexicon construction:

    1. metadata

    2. information – what kind of a dictionary we are gonna produce: semasiological or onomasiological

    3. linguistic analysis

  2. Access to data – what kind of media:

    1. traditional print media

    2. hyperlexicon – CD, Internet

    3. software with lexicon component – word processing, speech processing

  3. Lexical evaluation:

    1. internal: 1. consistency 2. completeness

    2. external: utility for the user

  • LEXICAL DATA ACQUISITION

  • From (text) corpus to lexicon:

    • Corpus Data:

      • Layer 1 – Primary Data (audio/video recordings): there are different corpora on the web or different companies have their own established corpora

      • Layer 2 – Secondary Data (written data – transcription, annotation, metadata)

    • Lexicon – from these texts we construct a lexicon:

      • Layer 1 – Corpus Lexicon (wordlist, concordances): a list of the words in the corpus

      • Layer 2 – Lexicon Matrix or Lexicon Table (entries x data categories, no generalizations): make a table from the information

      • Layer 3 – Lexicon with Selected Generalizations: decide what type of lexicon you want to produce, semasiological or onomasiological

      • Layer 4 – Lexicon with Generalization Hierarchies: integration of mesostructure in the dictionary itself; so, the dictionary becomes a huge network

  • FROM CORPUS TO LEXICON. CONCORDANCES

    • Concordance – a basic tool used by lexicographer for gaining material from texts; Examples: Biblical concordances, concordances of legal/literary texts

    • A KWIC (Key Word In Context) concordance is a special kind of preliminary, corpus-based dictionary: each word in a text corpus is paired with its contexts of occurrence in this corpus – collect the words from the text corpus plus their contexts. NOTE: Google is a special kind of KWIC concordance

An example text:

“Carrot pulled her aside as a couple of dwarfs approached the door purposefully.” (p.167) by Terry Pratchett, Thud!

Alphabetically ordered KWIC; keywords with right-hand contexts (3-word context):

Carrot pulled her aside

pulled her aside as

her aside as a

aside as a couple

as a couple of

a couple of dwarfs

couple of dwarfs approached

of dwarfs approached the

dwarfs approached the door

approached the door purposefully

the door purposefully

door purposefully

purposefully

  • CONCORDING ON THE WEB

    • the first: HyprLex, VerMobil HyprLex

    • some more: General information on concordancing, Corpus Linguistics

  • A KWIC CONCORDANCE ENGINE (from Internet)

  • KWIC concordance construction:

  1. Corpus creation – create corpus

  2. Tokenisation – token or units, corpus should be tokenised – individual words:

    carrot pulled her aside as a couple of dwarfs approached the door purposefully

  3. Keyword list extraction – collect all the words in a list

  4. Context Collation – collect context we are interested in (preceding or following, a whole sentence, a paragraph, a text, only 3-4 words, etc.)

  5. Keyword search

  6. Output Formatting

  • SIMPLEST KWIC PROCEDURE

  1. Corpus creation – make a corpus of texts in electronic format

  2. Tokenisation (re-process each text) – 1. Process punctuation marks 2. Break the text into context units (lines/sentences)

  3. Keyword list extraction (all words in text) – sort the list alphabetically and then remove all duplicate words

  4. Context collation (for each keyword)

  5. Search for KWIC in corpus

  6. Store output and format – for printing, hypertext (CD, web)

  • COMPUTING A KWIC CONCORDANCE

back to the beginning



Quizzes:

1. What is a KWIC concordance?

A KWIC concordance is a special kind of preliminary, corpus based dictionary – each word in a text corpus us paired with its contexts of occurrences in this corpus.

2. Which are the two main components of lexicon construction based on empirical data?


3. Which layers of abstraction are involved in corpus acquisition?


4. Which layers of abstraction are involved in lexicon construction? Describe them.


5. Which layer do standard dictionary types typically belong to?


6. What are the 6 main steps in KWIC concordance construction? Explain each of them!

  1. Corpus creation: make a corpus of different texts. For example:

    In Ankh-Morpork, greatest of its cities, spring was nudged aside by summer, and summer was prodded in the back by autumn. Terry Pratchett, Feet of Clay

    He knew Nanny Ogg very well, but mainly as the person standing just behind Granny Weatherwax and smiling a lot. Terry Pratchett, Carpe Jugulum

    Carrot pulled her aside as a couple of dwarfs approached the door purposefully. Terry Pratchett, Thud!

  2. Tokenisation: re-process each text – 1. remove all punctuation marks 2. remove upper case letters 3. Break the text into context units (lines, sentences):

    he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot

  3. Keyword list extraction: first, create a list from the detokenised words; then, order the words alphabetically; and finally, remove the duplicate words:

    Keyword list:

    he

    knew

    nanny

    ogg

    very

    well

    but

    mainly

    as

    the

    person

    standing

    just

    behind

    granny

    weatherwax

    and

    smiling

    a

    lot

    Keyword list ordered alphabetically without any duplicate words:

    a

    and

    as

    behind

    but

    granny

    he

    just

    knew

    lot

    mainly

    nanny

    ogg

    person

    smiling

    standing

    the

    very

    well

    weatherwax

  4. Context collation: pick up context units and write # at the beginning and at the end; pair each word from the keyword list with its contexts:

    he knew nanny ogg very well but mainly as the person standing just behind granny weatherwax and smiling a lot

    # he knew

    he knew nanny

    knew nanny ogg

    nanny ogg very

    ogg very well

    very well but

    well but mainly

    but mainly as

    mainly as the

    as the person

    the person standing

    person standing just

    standing just behind

    just behind granny

    behind granny weatherwax

    granny weatherwax and

    weatherwax and smiling

    and smiling a

    smiling a lot

    a lot #

  5. Keyword search: for example, search for "just" – it is found in the middle of the following context unit: "standing just behind"

  6. Output formatting – store output and format:

    Tabulated table:

    a smiling a lot

    and weatherwax and smiling

    as mainly as the

    behind just behind granny

    but well but mainly

    granny behind granny weatherwax

    he # he knew

    just standing just behind

    knew he knew nanny

    lot a lot #

    mainly but mainly as

    nanny knew nanny ogg

    ogg nanny ogg very

    person the person standing

    smiling and smiling a

    standing person standing just

    the as the person

    very ogg very well

    well very well but

    weatherwax granny weatherwax and

     

    Normal table:

word

preceding context

KWIC

following context

    a

    smiling

    a

    lot

    and

    weatherwax

    and

    smiling

    as

    mainly

    as

    the

    behind

    just

    behind

    granny

    but

    well

    but

    mainly

    granny

    behind

    granny

    weatherwax

    he

    #

    he

    knew

    just

    standing

    just

    behind

    knew

    he

    knew

    nanny

    lot

    a

    lot

    #

    mainly

    but

    mainly

    as

    nanny

    knew

    nanny

    ogg

    ogg

    nanny

    ogg

    very

    person

    the

    person

    standing

    smiling

    and

    smiling

    a

    standing

    person

    standing

    just

    the

    as

    the

    person

    very

    ogg

    very

    well

    well

    very

    well

    but

    weatherwax

    granny

    weatherwax

    and





back to the beginning



Homework:

The text for analysis:

Ginger beer

Fermentation has been used by mankind for thousands of years for raising bread, fermenting wine and brewing beer. The products of the fermentation of sugar by baker"s yeast Saccharomyces cerevisiae (a fungus) are ethyl alcohol and carbon dioxide. Carbon dioxide causes bread to rise and gives effervescent drinks their bubbles. This action of yeast on sugar is used to "carbonate" beverages, as in the addition of bubbles to champagne).

Task:

Discuss:

semantic components – components of fermentation are raising bread, fermenting wine, brewing beer; components of fermentation of sugar by baker"s yeast are ethyl alcohol and carbon dioxide;

semantic components of SDD for words from the text:

wine – an alcoholic drink made from grapes, or a type of this drink: e.g. a glass of wine; e.g. a delicious Californian wine; hyponyms: red/white wine, dry/sweet/sparkling wine

components:

wine is definiendum

drink is genus proximum

alcoholic...made from grapes – differentia specifica

the examples are a contextual component of the definition or else the contextual/syntagmatic definition

the hyponyms red wine, white wine, dry wine, sweet wine, sparkling wine are paradigmatic components of the definition or they are paradigmatic definitions

semantic fields – beverages: wine, beer, ethyl alcohol, champagne; products of fermentation: bread, wine, beer, ethyl alcohol, carbon dioxide;

semantic relations:

sugar-salt: co-hyponyms, antonyms

wine-champagne-beer: co-hyponyms, antonyms

drink-beverage: co-hyponyms, synonyms

beverages-beer, wine, champagne: a hyperonym (superordinate term) and 3 hyponyms (subordinate terms)

champagne-bubbles: meronyms (bubbles appear to be parts of the champagne according to the text)

definitions:

champagne – a French white wine with a lot of bubbles, drunk on special occasions; any of various effervescent wines, such as champagne: e.g. In the novels, all the naughty people take champagne and oysters.

SDD - a French white wine with a lot of bubbles, drunk on special occasions

Syntagmatic (contextual definition) - e.g. In the novels, all the naughty people take champagne and oysters.

Paradigmatic definition - any of various effervescent wines, such as champagne (champagne is a hyponym of effervescent wines).
image
 27.01.2006 /font>http://www.thefreedictionary.com/bubbly




Гласувай:
0



Следващ постинг
Предишен постинг

Няма коментари
Търсене

Архив
Календар
«  Април, 2024  
ПВСЧПСН
1234567
891011121314
15161718192021
22232425262728
2930