`granicus_archiver.legistar.search_indexing`¶

granicus_archiver.legistar.search_indexing.SchemaTerm¶

Valid terms in the Whoosh schema

alias of Literal[‘file_id’, ‘title’, ‘category’, ‘content’, ‘datetime’]

class granicus_archiver.legistar.search_indexing.SearchResultRaw[source]¶: Bases: TypedDict

class granicus_archiver.legistar.search_indexing.SearchResult(file_id: FileId, category: Category, page_num: int, matched_terms: list[SchemaTerm], score: float, fields: dict | None = None, highlights: list[str] | str | None = None)[source]¶

Bases: NamedTuple

A single search result from the Whoosh index

Parameters:

file_id (FileId)
category (Category)
page_num (int)
matched_terms (list[Literal['file_id', 'title', 'category', 'content', 'datetime']])
score (float)
fields (dict | None)
highlights (list[str] | str | None)

file_id: FileId¶: Unique identifier for the file

category: Category¶: Category of the item

page_num: int¶: Page number in the document where the match was found

matched_terms: list[Literal['file_id', 'title', 'category', 'content', 'datetime']]¶: List of schema terms that matched the query

score: float¶: Relevance score of the search result

fields: dict | None¶: Optional dictionary of additional fields from the index result

highlights: list[str] | str | None¶: Optional highlighted text snippets from the search result

class granicus_archiver.legistar.search_indexing.FileId(rguid: REAL_GUID, file_uid: LegistarFileUID)[source]¶

Bases: NamedTuple

Unique identifier for a Legistar file

Parameters:

rguid (REAL_GUID)
file_uid (LegistarFileUID)

rguid: REAL_GUID¶

file_uid: LegistarFileUID¶

property as_str: str¶: String representation of the FileId

classmethod from_str(s: str) → FileId[source]¶

Create a FileId from its string representation

Parameters:: s (str)
Return type:: FileId

granicus_archiver.legistar.search_indexing.build_schema() → Schema[source]¶

Build the whoosh schema for indexing Legistar files

Return type:: Schema

granicus_archiver.legistar.search_indexing.build_index(index_dir: str | Path) → FileIndex[source]¶

Build a whoosh index at the given directory

If the directory does not exist, it will be created.

Parameters:: index_dir (str | Path) – Directory to store the index
Return type:: FileIndex

granicus_archiver.legistar.search_indexing.get_searcher(index: FileIndex | str | Path) → Generator[Searcher, None, None][source]¶

Context manager to get a searcher

Parameters:: index (FileIndex | str | Path)
Return type:: Generator[Searcher, None, None]

granicus_archiver.legistar.search_indexing.search_contents(query_str: str, index: FileIndex | str | Path, limit: int = 10) → list[SearchResult][source]¶

Search the Whoosh index for the given query string

Parameters:

query_str (str) – Query string to search for
index (FileIndex | str | Path) – Whoosh index or path to the index directory
limit (int) – Maximum number of results to return

Returns:

A list of SearchResult objects

Return type:

list[SearchResult]

granicus_archiver.legistar.search_indexing.add_document(file_id: FileId, category: Category, title: str, content: str, dt: datetime, page_num: int, writer: IndexWriter) → None[source]¶

Add a document to the index

Note

The document is not committed until whoosh.IndexWriter.commit() is called.

Parameters:

file_id (FileId)
category (Category)
title (str)
content (str)
dt (datetime)
page_num (int)
writer (IndexWriter)

Return type:

None

granicus_archiver.legistar.search_indexing.document_exists(file_id: FileId, index: FileIndex, searcher: Searcher) → bool[source]¶

Check if a document with the given file_id exists in the index

Parameters:

file_id (FileId)
index (FileIndex)
searcher (Searcher)

Return type:

bool

granicus_archiver.legistar.search_indexing.iter_files_for_item(item: RGuidDetailResult) → Iterator[LegistarFile][source]¶

Iterate over the Legistar files for a given Legistar item

Parameters:: item (RGuidDetailResult)
Return type:: Iterator[LegistarFile]

granicus_archiver.legistar.search_indexing.index_legistar_item(writer: IndexWriter, legistar_item: RGuidDetailResult) → tuple[int, set[FileId]][source]¶

Index a Legistar item into the index

Returns:

A tuple of (number of documents indexed, set of FileId objects indexed)

Parameters:

writer (IndexWriter)
legistar_item (RGuidDetailResult)

Return type:

tuple[int, set[FileId]]

granicus_archiver.legistar.search_indexing.extract_pdf_text(infile: Path | str) → list[str][source]¶

Extract text from a pdf file using layout mode

See pypdf.PageObject.extract_text() for details.

Parameters:: infile (Path | str) – The input PDF file
Returns:: A list of strings, one per page in the PDF
Return type:: list[str]

granicus_archiver.legistar.search_indexing.index_legistar_items(config: Config, max_docs: int | None) → None[source]¶

Index Legistar items into the index

Parameters:

config (Config) – Configuration object
max_docs (int | None) – Maximum number of documents to index in this run. If None, index all documents.

Return type:

None

granicus_archiver.legistar.search_indexing¶

`granicus_archiver.legistar.search_indexing`¶