| The
Extractor API may be used in many different ways, depending on the
intended application. Here are a few examples of how you might
implement for:
The API is designed to
allow maximum flexibility for a wide variety of applications.
===============================
One Document, One Set
of Stop Words
This is a sketch of how
to use the API to process a single text file. This example assumes
that there is no need to customize the stop words.
/* initialize */
call ExtrCreateStopMemory();
call ExtrCreateDocumentMemory();
/* process text file */
open the text file;
while the end of the text file has not yet been reached {
read a block of the text file into a buffer;
call ExtrReadDocumentBuffer();
}
call ExtrSignalDocumentEnd();
close the text file;
/* print out keyphrases */
call ExtrGetPhraseListSize();
for i = 0 to (PhraseListSize - 1) do {
call ExtrGetPhraseByIndex();
display the i-th keyphrase to the user;
}
/* free memory */
call ExtrClearStopMemory();
call ExtrClearDocumentMemory();
Note that Extractor
does not manage the text buffer. Extractor reads the text buffer,
but does not change the state of the text buffer in any way. The
text buffer must be allocated and freed outside of Extractor.
This sketch is
essentially what is implemented in the API test wrapper, test_api.c.
===============================
Many Documents, One
Set of Stop Words
This is a sketch of how
to use the API to process many documents. This example assumes that
there is no need to customize the stop words.
/* initialize the stop words */
call ExtrCreateStopMemory();
/* process the text files */
for each document in the list of documents {
/* initialize the document memory */
call ExtrCreateDocumentMemory();
/* process the current document */
open the text file for the current document;
while the end of the text file has not yet been reached {
read a block of the text file into a buffer;
call ExtrReadDocumentBuffer();
}
call ExtrSignalDocumentEnd();
close the text file for the current document;
/* print out keyphrases */
call ExtrGetPhraseListSize();
for i = 0 to (PhraseListSize - 1) do {
call ExtrGetPhraseByIndex();
display the i-th keyphrase to the user;
}
/* free the document memory */
call ExtrClearDocumentMemory();
}
/* free stop word memory */
call ExtrClearStopMemory();
In this example, all of
the documents share the same set of stop words. Therefore the stop
word memory is only created once. This is more efficient than
putting ExtrCreateStopMemory inside the for each document loop.
===============================
Many Document, Many
Sets of Stop Words
This is a sketch of how
to use the API to process many documents. In this example, each
document is processed with its own set of stop words.
/* process the text files */
for each document in the list of documents {
/* initialize */
call ExtrCreateDocumentMemory();
call ExtrCreateStopMemory();
/* load custom stop words */
open the text file for the custom stop words for the current document;
while the end of the text file has not yet been reached {
read a stop word from the file;
call ExtrAddStopWord();
}
close the text file for the custom stop words;
/* process the current document */
open the text file for the current document;
while the end of the text file has not yet been reached {
read a block of the text file into a buffer;
call ExtrReadDocumentBuffer();
}
call ExtrSignalDocumentEnd();
close the text file for the current document;
/* print out keyphrases */
call ExtrGetPhraseListSize();
for i = 0 to (PhraseListSize - 1) do {
call ExtrGetPhraseByIndex();
display the i-th keyphrase to the user;
}
/* free memory */
call ExtrClearDocumentMemory();
call ExtrClearStopMemory();
}
If the application is a
server with many different users, then the users could each have
their own personal list of stop words. For example, if the server
processes e-mail, then the users might want their own names to be
stop words.
===============================
Process a Document in
Sections
This is a sketch of how
to use the API to process a large document, one section at a time.
This example assumes that the same stop words are used for all
sections.
This could be useful
for producing an annotated table of contents for a book. Each
section in the book could be annotated by a list of keyphrases,
where the keyphrases are extracted from that section alone.
This could also be
useful for producing an index. Extractor generates a list of three
to thirty keyphrases for each document that it processes (depending
on ExtrSetNumberPhrases). Thirty keyphrases is not enough to make an
index for a book. However, if the book is processed in blocks of
about one to five pages per block, then Extractor will generate up
to thirty keyphrases for each block. A two-hundred page book could
then yield six thousand keyphrases. This will be more than enough to
make a good index.
/* initialize stop words */
call ExtrCreateStopMemory();
/* process document */
open the text file for the document;
while the end of the text file has not yet been reached {
/* process sections */
for each section of the document {
/* initialize memory for current section */
call ExtrCreateDocumentMemory();
/* process current section */
while the end of the section has not yet been reached {
read a block of the current section into a buffer;
call ExtrReadDocumentBuffer();
}
call ExtrSignalDocumentEnd();
/* print out keyphrases */
call ExtrGetPhraseListSize();
for i = 0 to (PhraseListSize - 1) do {
call ExtrGetPhraseByIndex();
display the i-th keyphrase to the user;
}
/* free memory for current section */
call ExtrClearDocumentMemory();
}
}
close the text file;
/* free stop words */
call ExtrClearStopMemory();
Note that Extractor can
efficiently handle very large documents without requiring the
documents to be split into smaller chunks. Splitting a document into
sections is not necessary to increase the speed or capacity of
Extractor.
|