TYPO3 Logo
Site Crawler Extension
Options
Give feedback View source How to edit Edit on GitHub Full documentation (single file)

Site Crawler Extension

  • Introduction
  • Configuration
    • Extension Manager Configuration
    • Configuration records
    • Page TSconfig Reference (tx_crawler.crawlerCfg)
    • HTTP Authentication
    • Examples
      • EXT:news
  • Executing the queue
    • Run via command controller
    • Executing queue with cron-job
    • Run via backend
    • Building and Executing queue right away (from cli)
  • Scheduler
  • Use cases
    • Cache warm up
    • Indexed Search
  • Features
    • Automatic add pages to Queue
    • Pollable processing instructions
    • Multi process support
    • Hooks
    • PSR-14 Events
    • Priority Crawling
  • Troubleshooting
  • Links
  • Sitemap
  1. Site Crawler Extension
  2. Use cases
  3. Indexed Search
Give feedback Edit on GitHub

Indexed Search

The TYPO3 Crawler is quite often used to regenerate the index of Indexed Search.

Frontend indexing setup

Here we will configure indexed_search to automatically index pages when they are visited by users in the frontend.

  1. Make sure you do not have config.no_cache set in your TypoScript configuration - this prevents indexing.
  2. Admin Tools > Settings > Extension configuration > indexed_search: Make sure "Disable Indexing in Frontend" is disabled (thus frontend indexing is enabled).
  3. Web > List: In your site root, create a new "Indexing Configuration" record.

    • Type: Page tree
    • Depth: 4
    • Access > Enable: Activate it

    Save.

  4. Edit the page settings of a visible page and make sure that Behaviour > Miscellaneous > Include in search is activated.
  5. View this page in the frontend.
  6. Web > Indexing > Detailed statistics: The page you just visited is shown as "Indexed" now - with Filename, Filesize and indexing timestamp.

If this did not work, clear both frontend and all caches. Getting frontend indexing to work is crucial for the rest of this How-To.

Crawler setup

Now that frontend indexing works, it's time to configure crawler to re-index all pages - instead of relying on visitors to trigger indexing:

  1. Admin Tools > Settings > Extension configuration > indexed_search: Enable "Disable Indexing in Frontend", so that indexing only happens through the crawler.
  2. Web > List: In your site root, create a new "Crawler configuration" record.

    • Name: crawl-mysite
    • Processing instruction filter: Enable "Re-Indexing [tx_indexedsearch_reindex]"

    Save.

  3. Do a manual crawl on command line. "23" is the site root page UID:

    $ ./vendor/bin/typo3 crawler:buildQueue 23 crawl-mysite --depth 2 --mode exec -vvv
    
    Executing 2 requests right away:
    [19-08-25 14:13] http://example.org/ (URL already existed)
    [19-08-25 14:13] http://example.org/faq (URL already existed)
    <warning>Internal:  (Because page is hidden)</warning>
    <warning>Tools:  (Because doktype "254" is not allowed)</warning>
    Processing
    
    http://example.org/ (tx_indexedsearch_reindex) =>
    
    OK:
       User Groups:
    
    http://example.org/faq (tx_indexedsearch_reindex) =>
    
    OK:
       User Groups:
    
    2/2 [============================] 100%  1 sec/1 sec  42.0 MiB
    Copied!
  4. Web > Indexing: All pages should be indexed now.

Nightly crawls

We want crawler to run automatically at night:

  1. Create the first scheduler task that will create a list with page URLs that the second task will crawl.

    System > Scheduler > +:

    • "Class" is "Execute console commands"
    • "Frequency" is every night at 2 o'clock: 0 2 * * *
    • "Schedulable Command" must be "crawler:buildQueue"

    Save and continue editing:

    • "Argument: page" must be the UID of the site root page (23)
    • "Argument: conf" is crawl-mysite
    • "Option: depth" must be enabled and set to 99

    Save.

  2. Run the task manually, either via the scheduler module in the backend or via command line:

    $ ./vendor/bin/typo3 scheduler:run --task=1 -f -vv
    Task #1 was executed
    
    Copied!

    (1 is the scheduler task ID)

  3. Verify that the pages have been queued by looking at Web > Info > Site Crawler > Crawler log > 2 levels. The pages have a timestamp in the "Scheduled" column.
  4. Create the second scheduler task that will index all the page URLs queued by the first task:

    System > Scheduler > +:

    • "Class" is "Execute console commands"
    • "Frequency" is every 10 minutes: */10 * * * *
    • "Schedulable Command" must be "crawler:processQueue"

    Save and continue editing:

    • "Option: amount" should be 50, or any value that the system is able to process within the 10 minutes.

    Save.

  5. Run the task manually, again via the scheduler module in the backend (only if it's a small page!) or via command line:

    $ ./vendor/bin/typo3 scheduler:run --task=2 -f -vv
    Task #2 was executed
    
    Copied!

    This crawl task will run much longer that the queue task.

  6. Verify that the pages have been indexed by looking at Web > Indexing. All queued pages should have an index date now.

    Web > Info > Site Crawler > Crawler log > 2 levels should show a timestamp in the "Run-time" column, as well as OK in the "Status" column.

This completes the basic crawler setup. Every night at 2:00, all pages will be re-indexed in batches of 50.

  • Previous
  • Next
Reference to the headline

Copy and freely share the link

This link target has no permanent anchor assigned. You can make a pull request on GitHub to suggest an anchor. The link below can be used, but is prone to change if the page gets moved.

Copy this link into your TYPO3 manual.

  • Issues
  • Repository

Last rendered: Aug 20, 2025 12:58

© 2002-2021 AOE GmbH, since 2021 Tomas Norre Mikkelsen
  • Legal Notice
  • Privacy Policy