pdf-indexer

Version:	org.nasdanika.launcher.demo@2025.6.0

Usage: nsd pdf-indexer [-hV] [--granularity=<granularity>]
                       [[--progress-logger=<progressLogger>]
                       [--progress-output=<progressOutput>] [--progress-json]
                       [--progress-console] [--progress-data]]
                       [[--chunk-size=<chunkSize>]
                       [--chunks-overlap=<chunksOverlap>]
                       [--chunk-encoding-type=<encodingType>]]
                       [[--embeddings-provider=<embeddingsProvider>]
                       [--embeddings-model=<embeddingsModel>]
                       [--embeddings-version=<embeddingsVersion>]]
                       [[--hnsw-ef=<ef>]
                       [--hnsw-ef-contruction=<efConstruction>] [--hnsw-m=<m>]
                       [--hnsw-remove-enabled] [--hnsw-threads=<threads>]
                       [--hnsw-progress-update-interval=<progressUpdateInterval>
                       ] [--hnsw-distance-function=<distanceFunction>]
                       [--hnsw-normalize]] <output> <textMap> <String=File>...
Creates a vector index and a URI -> text
mapping from a PDF file or a directory of
PDF files
      <output>             Index output file
      <textMap>            URI to plain text map JSON output
      <String=File>...     Input <base uri>=<file or directory>
      --granularity=<granularity>
                           Text granularity
                           Valid values: document, page, article, paragraph
                           Default value: page
  -h, --help               Show this help message and exit.
  -V, --version            Print version information and exit.
Progress monitor
      --progress-console   Output progress to console
      --progress-data      Output progress data
      --progress-json      Output progress in JSON
      --progress-logger=<progressLogger>
                           Output logger for progress monitor
      --progress-output=<progressOutput>
                           Output file for progress monitor
Chunking
      --chunk-encoding-type=<encodingType>
                           Chunk encoding type
                           Valid values: R50K_BASE, P50K_BASE, P50K_EDIT,
                             CL100K_BASE, O200K_BASE
                           Default value: CL100K_BASE
      --chunk-size=<chunkSize>
                           Chunk size in tokens
      --chunks-overlap=<chunksOverlap>
                           Chunks overlap in tokens
Embeddings
      --embeddings-model=<embeddingsModel>
                           Embeddings model
      --embeddings-provider=<embeddingsProvider>
                           Embeddings provider
      --embeddings-version=<embeddingsVersion>
                           Embeddings version
Vector index
      --hnsw-distance-function=<distanceFunction>
                           Vector distance function
                           Valid values: BRAY_CURTIS, CANBERRA, CORRELATION,
                             COSINE, EUCLIDEAN, INNER_PRODUCT, MANHATTAN,
                             VECTOR_FLOAT_128_BRAY_CURTIS,
                             VECTOR_FLOAT_128_CANBERRA,
                             VECTOR_FLOAT_128_COSINE,
                             VECTOR_FLOAT_128_EUCLIDEAN,
                             VECTOR_FLOAT_128_INNER_PRODUCT,
                             VECTOR_FLOAT_128_MANHATTAN,
                             VECTOR_FLOAT_256_BRAY_CURTIS,
                             VECTOR_FLOAT_256_CANBERRA,
                             VECTOR_FLOAT_256_COSINE,
                             VECTOR_FLOAT_256_EUCLIDEAN,
                             VECTOR_FLOAT_256_INNER_PRODUCT,
                             VECTOR_FLOAT_256_MANHATTAN
                           Default value: COSINE
      --hnsw-ef=<ef>       Size of the dynamic list for the nearest neighbors
                           Default value: 200
      --hnsw-ef-contruction=<efConstruction>
                           Controls the index time / index precision
                           Default value: 200
      --hnsw-m=<m>         The number of bi-directional links created
                           for every new element during construction
                           Default value: 16
      --hnsw-normalize     If true, vectors are normalized
      --hnsw-progress-update-interval=<progressUpdateInterval>
                           After indexing this many items progress will be
                           reported. The last element will always be
                           reported regardless of this setting.
                           Default value: 100000
      --hnsw-remove-enabled
                           If true, removal from the index is enabled
      --hnsw-threads=<threads>
                           Number of threads to use for parallel indexing
                           Default to the number of available processors

Options

--hnsw-ef

The size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. The value ef of can be anything between k (number of items to return from search) and the size of the dataset.

--hnsw-ef-contruction

The option has the same meaning as --hnsw-ef, but controls the index time / index precision. Bigger ef-construction leads to longer construction, but better index quality. At some point, increasing ef-construction does not improve the quality of the index. One way to check if the selection of ef-construction was ok is to measure a recall for M nearest neighbor search when ef = ef-construction: if the recall is lower than 0.9, then there is room for improvement.

--hnsw-m

Sets the number of bi-directional links created for every new element during construction. Reasonable range for m is 2-100. Higher m work better on datasets with high intrinsic dimensionality and/or high recall, while low m work better for datasets with low intrinsic dimensionality and/or low recalls. The parameter also determines the algorithm’s memory consumption. As an example for d = 4 random vectors optimal m for search is somewhere around 6, while for high dimensional datasets (word embeddings, good face descriptors), higher m are required (e.g. m = 48, 64) for optimal performance at high recall. The range m = 12-48 is ok for the most of the use cases. When m is changed one has to update the other parameters. Nonetheless, ef and efConstruction parameters can be roughly estimated by assuming that m efConstruction is a constant¹.

m javadoc
↩