Interface: WebCrawlerDataSourceProps
March 19, 2026 ยท View on GitHub
@cdklabs/generative-ai-cdk-constructs
@cdklabs/generative-ai-cdk-constructs / bedrock / WebCrawlerDataSourceProps
Interface: WebCrawlerDataSourceProps
Interface to create a new standalone data source object.
Extends
Properties
chunkingStrategy?
readonlyoptionalchunkingStrategy?:ChunkingStrategy
The chunking stategy to use for splitting your documents or content. The chunks are then converted to embeddings and written to the vector index allowing for similarity search and retrieval of the content.
Default
ChunkingStrategy.DEFAULT
Inherited from
WebCrawlerDataSourceAssociationProps.chunkingStrategy
contextEnrichment?
readonlyoptionalcontextEnrichment?:ContextEnrichment
The context enrichment configuration to use.
Default
- No context enrichment is used.
Inherited from
WebCrawlerDataSourceAssociationProps.contextEnrichment
crawlingRate?
readonlyoptionalcrawlingRate?:number
The max rate at which pages are crawled, up to 300 per minute per host. Higher values will decrease sync time but increase the load on the host.
Default
300
Inherited from
WebCrawlerDataSourceAssociationProps.crawlingRate
crawlingScope?
readonlyoptionalcrawlingScope?:CrawlingScope
The scope of the crawling.
Default
- CrawlingScope.DEFAULT
Inherited from
WebCrawlerDataSourceAssociationProps.crawlingScope
customTransformation?
readonlyoptionalcustomTransformation?:CustomTransformation
The custom transformation strategy to use.
Default
- No custom transformation is used.
Inherited from
WebCrawlerDataSourceAssociationProps.customTransformation
dataDeletionPolicy?
readonlyoptionaldataDeletionPolicy?:DataDeletionPolicy
The data deletion policy to apply to the data source.
Default
- Sets the data deletion policy to the default of the data source type.
Inherited from
WebCrawlerDataSourceAssociationProps.dataDeletionPolicy
dataSourceName?
readonlyoptionaldataSourceName?:string
The name of the data source.
Default
- A new name will be generated.
Inherited from
WebCrawlerDataSourceAssociationProps.dataSourceName
description?
readonlyoptionaldescription?:string
A description of the data source.
Default
- No description is provided.
Inherited from
WebCrawlerDataSourceAssociationProps.description
filters?
readonlyoptionalfilters?:CrawlingFilters
The filters (regular expression patterns) for the crawling. If there's a conflict, the exclude pattern takes precedence.
Default
None
Inherited from
WebCrawlerDataSourceAssociationProps.filters
kmsKey?
readonlyoptionalkmsKey?:IKey
The KMS key to use to encrypt the data source.
Default
- Service owned and managed key.
Inherited from
WebCrawlerDataSourceAssociationProps.kmsKey
knowledgeBase
readonlyknowledgeBase:IKnowledgeBase
The knowledge base to associate with the data source.
maxPages?
readonlyoptionalmaxPages?:number
The maximum number of pages to crawl. The max number of web pages crawled from your source URLs, up to 25,000 pages. If the web pages exceed this limit, the data source sync will fail and no web pages will be ingested.
Default
- No limit
Inherited from
WebCrawlerDataSourceAssociationProps.maxPages
parsingStrategy?
readonlyoptionalparsingStrategy?:ParsingStrategy
The parsing strategy to use.
Default
- No Parsing Stategy is used.
Inherited from
WebCrawlerDataSourceAssociationProps.parsingStrategy
sourceUrls
readonlysourceUrls:string[]
The source urls in the format https://www.sitename.com.
Maximum of 100 URLs.
Inherited from
WebCrawlerDataSourceAssociationProps.sourceUrls
userAgent?
readonlyoptionaluserAgent?:string
The user agent string to use when crawling.
Default
- Default user agent string
Inherited from
WebCrawlerDataSourceAssociationProps.userAgent
userAgentHeader?
readonlyoptionaluserAgentHeader?:string
The user agent header to use when crawling. A string used for identifying the crawler or bot when it accesses a web server. The user agent header value consists of the bedrockbot, UUID, and a user agent suffix for your crawler (if one is provided). By default, it is set to bedrockbot_UUID. You can optionally append a custom suffix to bedrockbot_UUID to allowlist a specific user agent permitted to access your source URLs.
Default
- Default user agent header (bedrockbot_UUID)