pdftron::PDF::TextExtractor::Word Class Reference

TextExtractor::Word object represents a word on a PDF page. More...

#include <TextExtractor.h>

List of all members.

Public Member Functions

int GetNumGlyphs ()
void GetBBox (double out_bbox[4])
void GetQuad (double out_quad[8])
void GetGlyphQuad (int glyph_idx, double out_quad[8])
Style GetCharStyle (int char_idx)
Style GetStyle ()
int GetStringLen ()
const UnicodeGetString ()
Word GetNextWord ()
int GetCurrentNum ()
bool IsValid ()
bool operator== (const Word &)
bool operator!= (const Word &)
 Word ()


Detailed Description

TextExtractor::Word object represents a word on a PDF page.

Each word contains a sequence of characters in one or more styles (see TextExtractor::Style).


Constructor & Destructor Documentation

pdftron::PDF::TextExtractor::Word::Word (  ) 


Member Function Documentation

int pdftron::PDF::TextExtractor::Word::GetNumGlyphs (  ) 

Returns:
The number of glyphs in this word.

void pdftron::PDF::TextExtractor::Word::GetBBox ( double  out_bbox[4]  ) 

Parameters:
out_bbox The bounding box for this word (in unrotated page coordinates).
Note:
To account for the effect of page '/Rotate' attribute, transform all points using page.GetDefaultMatrix().

void pdftron::PDF::TextExtractor::Word::GetQuad ( double  out_quad[8]  ) 

Parameters:
out_quad The quadrilateral representing a tight bounding box for this word (in unrotated page coordinates).

void pdftron::PDF::TextExtractor::Word::GetGlyphQuad ( int  glyph_idx,
double  out_quad[8] 
)

Parameters:
glyph_idx The index of a glyph in this word.
out_quad The quadrilateral representing a tight bounding box for a given glyph in the word (in unrotated page coordinates).

Style pdftron::PDF::TextExtractor::Word::GetCharStyle ( int  char_idx  ) 

Parameters:
char_idx The index of a character in this word.
Returns:
The style associated with a given character.

Style pdftron::PDF::TextExtractor::Word::GetStyle (  ) 

Returns:
predominant style for this word.

int pdftron::PDF::TextExtractor::Word::GetStringLen (  ) 

Returns:
the number of characters in this word.

const Unicode* pdftron::PDF::TextExtractor::Word::GetString (  ) 

Returns:
the content of this word represented as a Unicode string.

Word pdftron::PDF::TextExtractor::Word::GetNextWord (  ) 

Returns:
the next word on the current line.

int pdftron::PDF::TextExtractor::Word::GetCurrentNum (  ) 

Returns:
the index of this word of the current line. A word that starts the line will return 0, whereas the last word in the line will return (line.GetNumWords()-1).

bool pdftron::PDF::TextExtractor::Word::IsValid (  ) 

Returns:
true if this is a valid word, false otherwise.

bool pdftron::PDF::TextExtractor::Word::operator== ( const Word  ) 

bool pdftron::PDF::TextExtractor::Word::operator!= ( const Word  ) 


© 2002-2010 PDFTron Systems Inc.