granicus_archiver.legistar.search_indexing

granicus_archiver.legistar.search_indexing.SchemaTerm

Valid terms in the Whoosh schema

alias of Literal[‘file_id’, ‘title’, ‘category’, ‘content’, ‘datetime’]

class granicus_archiver.legistar.search_indexing.SearchResultRaw[source]

Bases: TypedDict

class granicus_archiver.legistar.search_indexing.SearchResult(file_id: FileId, category: Category, page_num: int, matched_terms: list[SchemaTerm], score: float, fields: dict | None = None, highlights: list[str] | str | None = None)[source]

Bases: NamedTuple

A single search result from the Whoosh index

Parameters:
file_id: FileId

Unique identifier for the file

category: Category

Category of the item

page_num: int

Page number in the document where the match was found

matched_terms: list[Literal['file_id', 'title', 'category', 'content', 'datetime']]

List of schema terms that matched the query

score: float

Relevance score of the search result

fields: dict | None

Optional dictionary of additional fields from the index result

highlights: list[str] | str | None

Optional highlighted text snippets from the search result

class granicus_archiver.legistar.search_indexing.FileId(rguid: REAL_GUID, file_uid: LegistarFileUID)[source]

Bases: NamedTuple

Unique identifier for a Legistar file

Parameters:
rguid: REAL_GUID
file_uid: LegistarFileUID
property as_str: str

String representation of the FileId

classmethod from_str(s: str) FileId[source]

Create a FileId from its string representation

Parameters:

s (str)

Return type:

FileId

granicus_archiver.legistar.search_indexing.build_schema() Schema[source]

Build the whoosh schema for indexing Legistar files

Return type:

Schema

granicus_archiver.legistar.search_indexing.build_index(index_dir: str | Path) FileIndex[source]

Build a whoosh index at the given directory

If the directory does not exist, it will be created.

Parameters:

index_dir (str | Path) – Directory to store the index

Return type:

FileIndex

granicus_archiver.legistar.search_indexing.get_searcher(index: FileIndex | str | Path) Generator[Searcher, None, None][source]

Context manager to get a searcher

Parameters:

index (FileIndex | str | Path)

Return type:

Generator[Searcher, None, None]

granicus_archiver.legistar.search_indexing.search_contents(query_str: str, index: FileIndex | str | Path, limit: int = 10) list[SearchResult][source]

Search the Whoosh index for the given query string

Parameters:
  • query_str (str) – Query string to search for

  • index (FileIndex | str | Path) – Whoosh index or path to the index directory

  • limit (int) – Maximum number of results to return

Returns:

A list of SearchResult objects

Return type:

list[SearchResult]

granicus_archiver.legistar.search_indexing.add_document(file_id: FileId, category: Category, title: str, content: str, dt: datetime, page_num: int, writer: IndexWriter) None[source]

Add a document to the index

Note

The document is not committed until whoosh.IndexWriter.commit() is called.

Parameters:
Return type:

None

granicus_archiver.legistar.search_indexing.document_exists(file_id: FileId, index: FileIndex, searcher: Searcher) bool[source]

Check if a document with the given file_id exists in the index

Parameters:
Return type:

bool

granicus_archiver.legistar.search_indexing.iter_files_for_item(item: RGuidDetailResult) Iterator[LegistarFile][source]

Iterate over the Legistar files for a given Legistar item

Parameters:

item (RGuidDetailResult)

Return type:

Iterator[LegistarFile]

granicus_archiver.legistar.search_indexing.index_legistar_item(writer: IndexWriter, legistar_item: RGuidDetailResult) tuple[int, set[FileId]][source]

Index a Legistar item into the index

Returns:

A tuple of (number of documents indexed, set of FileId objects indexed)

Parameters:
Return type:

tuple[int, set[FileId]]

granicus_archiver.legistar.search_indexing.extract_pdf_text(infile: Path | str) list[str][source]

Extract text from a pdf file using layout mode

See pypdf.PageObject.extract_text() for details.

Parameters:

infile (Path | str) – The input PDF file

Returns:

A list of strings, one per page in the PDF

Return type:

list[str]

granicus_archiver.legistar.search_indexing.index_legistar_items(config: Config, max_docs: int | None) None[source]

Index Legistar items into the index

Parameters:
  • config (Config) – Configuration object

  • max_docs (int | None) – Maximum number of documents to index in this run. If None, index all documents.

Return type:

None