granicus_archiver.legistar.search_indexing¶
- granicus_archiver.legistar.search_indexing.SchemaTerm¶
Valid terms in the Whoosh schema
alias of
Literal[‘file_id’, ‘title’, ‘category’, ‘content’, ‘datetime’]
- class granicus_archiver.legistar.search_indexing.SearchResult(file_id: FileId, category: Category, page_num: int, matched_terms: list[SchemaTerm], score: float, fields: dict | None = None, highlights: list[str] | str | None = None)[source]¶
Bases:
NamedTupleA single search result from the Whoosh index
- Parameters:
- class granicus_archiver.legistar.search_indexing.FileId(rguid: REAL_GUID, file_uid: LegistarFileUID)[source]¶
Bases:
NamedTupleUnique identifier for a Legistar file
- Parameters:
rguid (REAL_GUID)
file_uid (LegistarFileUID)
- file_uid: LegistarFileUID¶
- granicus_archiver.legistar.search_indexing.build_schema() Schema[source]¶
Build the
whooshschema for indexing Legistar files- Return type:
- granicus_archiver.legistar.search_indexing.build_index(index_dir: str | Path) FileIndex[source]¶
Build a
whooshindex at the given directoryIf the directory does not exist, it will be created.
- granicus_archiver.legistar.search_indexing.get_searcher(index: FileIndex | str | Path) Generator[Searcher, None, None][source]¶
Context manager to get a searcher
- granicus_archiver.legistar.search_indexing.search_contents(query_str: str, index: FileIndex | str | Path, limit: int = 10) list[SearchResult][source]¶
Search the Whoosh index for the given query string
- Parameters:
- Returns:
A list of
SearchResultobjects- Return type:
- granicus_archiver.legistar.search_indexing.add_document(file_id: FileId, category: Category, title: str, content: str, dt: datetime, page_num: int, writer: IndexWriter) None[source]¶
Add a document to the index
Note
The document is not committed until
whoosh.IndexWriter.commit()is called.
- granicus_archiver.legistar.search_indexing.document_exists(file_id: FileId, index: FileIndex, searcher: Searcher) bool[source]¶
Check if a document with the given file_id exists in the index
- granicus_archiver.legistar.search_indexing.iter_files_for_item(item: RGuidDetailResult) Iterator[LegistarFile][source]¶
Iterate over the Legistar files for a given Legistar item
- Parameters:
item (RGuidDetailResult)
- Return type:
- granicus_archiver.legistar.search_indexing.index_legistar_item(writer: IndexWriter, legistar_item: RGuidDetailResult) tuple[int, set[FileId]][source]¶
Index a Legistar item into the index
- Returns:
A tuple of (number of documents indexed, set of
FileIdobjects indexed)- Parameters:
writer (IndexWriter)
legistar_item (RGuidDetailResult)
- Return type:
- granicus_archiver.legistar.search_indexing.extract_pdf_text(infile: Path | str) list[str][source]¶
Extract text from a pdf file using layout mode
See
pypdf.PageObject.extract_text()for details.