Site Crawler Extension 

Extension key

crawler

Package name

tomasnorre/crawler

Version

main

Language

en

Author

Tomas Norre Mikkelsen

Copyright

2005-2021 AOE GmbH, since 2021 Tomas Norre Mikkelsen

License

This document is published under the `Open Content License

Rendered

Tue, 04 Nov 2025 17:45:32 +0000


Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.


Table of Contents:

Introduction 

What does it do? 

The TYPO3 Crawler is an extension which provides possibilities, from both the TYPO3 backend and from CLIm that helps you with you cache and e.g. search index.

The Crawler implements several PSR-14 events, that you can use to "hook" into if you have certain requirements for your site at the given time.

See more ModifySkipPageEvent.

It features an API that other extensions can plug into. Example of this is "indexed_search" which uses crawler to index content defined by its Indexing Configurations. Other extensions supporting it are "staticpub" (publishing to static pages) or "cachemgm" (allows recaching of pages).

The requests of URLs is specially designed to request TYPO3 frontends with special processing instructions. The requests sends a TYPO3 specific header in the GET requests which identifies a special action. For instance the action requested could be to publish the URL to a static file or it could be to index its content - or re-cache the page. These processing instructions are also defined by third-party extensions (and indexed search is one of them). In this way a processing instruction can instruct the frontend to perform an action (like indexing, publishing etc.) which cannot be done with a request from outside.

Screenshots 

The extension provides a backend module which displays the queue and log and allows execution and status check of the "cronscript" from the backend for testing purposes.

CLI status display

CLI = Command Line Interface = shell script = cron script

Crawler queue (before processing) / log (after processing)

Interface for submitting a batch of URLs to be crawled

The parameter combinations are programmable through Page TSconfig or configuration records.

Extension Manager Configuration 

A lot of options were added to the extension manager configuration, that allow settings to improve and enable new crawler features:

Backend configuration: Settings

Backend configuration: Settings

Backend configuration: Queue

Backend configuration: Queue

Configuration records 

Formerly configuration was done by using pageTS (see below). This is still possible (fully backwards compatible) but not recommended. Instead of writing pageTS simply create a configuration record (table: tx_crawler_configuration) and put it on the topmost page of the pagetree you want to affect with this configuration.

The fields in these records are related to the pageTS keys described below.

Fields and their pageTS equivalents 

General 

Backend configuration record: General

Backend configuration record: General

Name
Corresponds to the "key" part in the pageTS setup e.g. tx_crawler.crawlerCfg.paramSets.myConfigurationKeyName
Protocol for crawling
Force HTTP, HTTPS or keep the configured protocol
Processing instruction filter
List of processing instructions. See also: paramSets.[key].procInstrFilter
Base URL
Set baseUrl (most likely the same as the entry point configured in your site configuration)
Pids only
List of Page Ids to limit this configuration to. See also: paramSets.[key].pidsOnly
Exclude pages
Comma separated list of page ids which should not be crawled. You can do recursive exclusion by adding uid+depth e.g. 6+3, this will ensure that all pages including pageUid 6 and 3 levels down will not be crawled.
Configuration
Parameter configuration. The values of GET variables are according to a special syntax. See also: :ref:`paramSets.[key]

<crawler-tsconfig-paramSets-key>`

Processing instruction parameters
Options for processing instructions. Will be defined in the respective third party modules. See also: :ref:`paramSets.[key].procInstrParams

<crawler-tsconfig-paramSets-key-procInstrParams>`

Crawl with FE user groups
User groups to set for the request. See also: paramSets.[key].userGroups and the hint in Cache warm up

Access 

Backend configuration record: Access

Backend configuration record: Access

Hide
If activated the configuration record is not taken into account.
Restrict access to
Restricts access to this configuration record to selected backend user groups. Empty means no restriction is set.

Page TSconfig Reference (tx_crawler.crawlerCfg) 

paramSets.[key]

paramSets.[key]
Type
string

Get Parameter configuration. The values of GET variables are according to a special syntax. From the code documentation (class.tx_crawler_lib.php):

  • Basically: If the value is wrapped in [...] it will be expanded according to the following syntax, otherwise the value is taken literally
  • Configuration is splitted by "|" and the parts are processed individually and finally added together
  • For each configuration part:
  • "[int]-[int]" = Integer range, will be expanded to all values in
    between, values included, starting from low to high (max. 1000). Example "1-34" or "-40--30"
  • "\_TABLE:" in the beginning of string indicates a look up in a
    table. Syntax is a string where [keyword]:[value] pairs are separated by semi-colon. Example "_TABLE:tt_content; _PID:123"
    - Keyword " \_TABLE ": (mandatory, starting string): Value is table
    name from TCA to look up into.
    - Keyword " \_ADDTABLE ": Additional tables to fetch data from.
    This value will be appended to "_TABLE" and used as "FROM" part of SQL query.
    - Keyword " \_PID ": Value is optional page id to look in (default
    is current page).
    - Keyword " \_RECURSIVE ": Optional flag to set recursive crawl
    depth. Default is 0.
    - Keyword " \_FIELD ": Value is field name to use for the value
    (default is uid).
    - Keyword " \_PIDFIELD ": Optional value that contains the name of
    the column containing the pid. By default this is "pid".
    - Keyword " \_ENABLELANG ": Optional flag. If set only the records
    from the current language are fetched.
    - Keyword " \_WHERE ": Optional flag. This can be use to e.g. if
    you don't want hidden records to be crawled.
  • - Default: Literal value

Examples:

&L=[|1|2|3]

&L=[0-3]
Copied!
packages/my_extension/Configuration/Sets/MySet/page.tsconfig
tx_crawler.crawlerCfg.paramSets {
    myConfigurationKeyName = &tx_myext[items]=[_TABLE:tt_myext_items;_PID:15;_WHERE: hidden = 0]
    myConfigurationKeyName {
        pidsOnly = 13
        procInstrFilter = tx_indexedsearch_reindex
    }
}
Copied!

paramSets.[key].procInstrFilter

paramSets.[key].procInstrFilter
Type
string

List of processing instructions, eg. "tx_indexedsearch_reindex" from indexed_search to send for the request. Processing instructions are necessary for the request to perform any meaningful action, since they activate third party activity.

paramSets.[key].procInstrParams.[procIn.key].[...]

paramSets.[key].procInstrParams.[procIn.key].[...]
Type
strings

Options for processing instructions. Will be defined in the respective third party modules.

Examples:

procInstrParams.tx_staticpub_publish.includeResources=1

paramSets.[key].pidsOnly

paramSets.[key].pidsOnly
Type
list of integers (pages uid)

List of Page Ids to limit this configuration to

paramSets.[key].force_ssl

paramSets.[key].force_ssl
Type
integer

Whether https should be enforced or not. 0 = false (default), 1 = true.

paramSets.[key].userGroups

paramSets.[key].userGroups
Type
list of integers (fe_groups uid)

User groups to set for the request.

paramSets.[key].baseUrl

paramSets.[key].baseUrl
Type
string

If not set, t3lib_div::getIndpEnv('TYPO3_SITE_URL') is used to request the page.

MUST BE SET if run from CLI (since TYPO3_SITE_URL does not exist in that context!)

[Page TSconfig: tx_crawler.crawlerCfg]

Example 

packages/my_extension/Configuration/Sets/MySet/page.tsconfig
tx_crawler.crawlerCfg.paramSets.test = &L=[0-3]
tx_crawler.crawlerCfg.paramSets.test {
    procInstrFilter = tx_indexedsearch_reindex
    pidsOnly = 1,5,13,55
    userGroups = 1
    force_ssl = 1
}
Copied!

HTTP Authentication 

If you want to use HTTP Authentication you need to configure your base url to contain user:pass

https://user:pass@www.mydomain.com/
Copied!

EXT:news 

The news extensions is one of the most used extensions in the TYPO3 CMS. This configuration is made under the assumption with a page tree looking similar to this:

Example Pagetree of EXT:news setup

Example Pagetree of EXT:news setup

If you want to have a Crawler Configuration that matches this, you can add following to the PageTS for PageId 56.

packages/my_extension/Configuration/Sets/MySet/page.tsconfig
tx_crawler.crawlerCfg.paramSets {
    tx_news = &tx_news_pi1[controller]=News&tx_news_pi1[action]=detail&tx_news_pi1[news]=[_TABLE:tx_news_domain_model_news; _PID:58; _WHERE: hidden = 0]
    tx_news {
        pidsOnly = 57
    }
}

# _PID:58 is the Folder where news records are stored.
# pidSOnly = 57 is the detail-view PageId.
Copied!

Now you can add the News detail-view pages to the crawler queue and have them in the cache and the indexed_search index if you are using that.

Respecting Categories in News 

On some installations news is configured in such a way, that news of category A have their detail view on one page and news of category B have their detail view on another page. In this case it would still be possible to view news of category A on the detail page for category B (example.com/detail-page-for-category-B/news-of-category-A). That means that each news article would be crawled twice - once on the detail page for category A and once on the detail page for category B. It is possible to use a PSR-14 event with news to prevent this.

On both detail pages include this typoscript setup:

packages/my_extension/Configuration/Sets/MySet/setup.typoscript
plugin.tx_news.settings {
    # categories and categoryconjunction are not considered in detail view, so they must be overridden
    overrideFlexformSettingsIfEmpty = cropMaxCharacters,dateField,timeRestriction,archiveRestriction,orderBy,orderDirection,backPid,listPid,startingpoint,recursive,list.paginate.itemsPerPage,list.paginate.templatePath,categories,categoryConjunction
    # see the news extension for possible values of categoryConjunction
    categoryConjunction = AND
    categories = <ID of respective category>
    detail.errorHandling = pageNotFoundHandler
}
Copied!

and register an event listener in your site package.

packages/my_extension/Configuration/Services.yaml
services:
  MyVendor\MyExtension\EventListeners\NewsDetailEventListener:
    tags:
      - name: event.listener
        identifier: 'myNewsDetailListener'
        event: GeorgRinger\News\Event\NewsDetailActionEvent
Copied!
packages/my_extension/Classes/EventListeners/NewsDetailEventListener.php
<?php

declare(strict_types=1);

namespace MyVendor\MyExtension\EventListeners;

use GeorgRinger\News\Event\NewsDetailActionEvent;

class NewsDetailEventListener
{
    public function __invoke(NewsDetailActionEvent $event): void
    {
        $assignedValues = $event->getAssignedValues();
        $newsItem = $assignedValues['newsItem'];
        $demand = $assignedValues['demand'];
        $settings = $assignedValues['settings'];

        if ($newsItem !== null) {
            $demandedCategories = $demand->getCategories();
            $itemCategories = $newsItem->getCategories()->toArray();
            $itemCategoryIds = \array_map(function ($category) {
                return (string) $category->getUid();
            }, $itemCategories);

            if (count($demandedCategories) > 0 && !$this::itemMatchesCategoryDemand(
                $settings['categoryConjunction'],
                $itemCategoryIds,
                $demandedCategories
            )) {
                $assignedValues['newsItem'] = null;
                $event->setAssignedValues($assignedValues);
            }
        }
    }

    protected static function itemMatchesCategoryDemand(
        string $categoryConjunction,
        array $itemCategoryIds,
        array $demandedCategories
    ): bool {
        $numOfDemandedCategories = \count($demandedCategories);
        $intersection = \array_intersect($itemCategoryIds, $demandedCategories);
        $numOfCommonItems = \count($intersection);

        switch ($categoryConjunction) {
            case 'AND':
                return $numOfCommonItems === $numOfDemandedCategories;
            case 'OR':
                return $numOfCommonItems > 0;
            case 'NOTAND':
                return $numOfCommonItems < $numOfDemandedCategories;
            case 'NOTOR':
                return $numOfCommonItems === 0;
        }
        return true;
    }
}
Copied!

Run via command controller 

Create queue 

replace vendor/bin/typo3 with your own cli runner
$ vendor/bin/typo3 crawler:buildQueue <page-id> <configurationKey1,configurationKey2,...> [--depth <depth>] [--number <number>] [--mode <exec|queue|url>]
Copied!

Run queue 

replace vendor/bin/typo3 with your own cli runner
$ vendor/bin/typo3 crawler:processQueue [--amount <pages to crawl>] [--sleeptime <milliseconds>] [--sleepafter <seconds>]
Copied!

Flush queue 

replace vendor/bin/typo3 with your own cli runner
$ vendor/bin/typo3 crawler:flushQueue <pending|finished|all>
Copied!

Executing queue with cron-job 

A "cron-job" refers to a script that runs on the server with time intervals.

For this to become reality you must ideally have a cron-job set up. This assumes you are running on Unix architecture of some sort. The crontab is often edited by crontab -e and you should insert a line like this:

* * * * * vendor/bin/typo3 crawler:buildQueue <startpage> <configurationKeys> > /dev/null
Copied!

This will run the script every minute. You should try to run the script on the command line first to make sure it runs without any errors. If it doesn't output anything it was successful.

You will need to have a user called _cli_ and you must have PHP installed as a CGI script as well in /usr/bin/.

The user _cli_ is created by the framework on demand if it does not exist at the first command line call.

Make sure that the user _cli_ has admin-rights.

In the CLI status menu of the Site Crawler info module you can see the status:

Status page in the backend

Status page in the backend

This is how it looks just after you ran the script. (You can also see the full path to the script in the bottom - this is the path to the script as you should use it on the command line / in the crontab)

If the cron-script stalls there is a default delay of 1 hour before a new process will announce the old one dead and run a new one. If a cron-script takes more than 1 minute and thereby overlaps the next process, the next process will NOT start if it sees that the "lock- file" exists (unless that hour has passed).

The reason why it works like this is to make sure that overlapping calls to the crawler CLI script will not run parallel processes. So the second call will just exit if it finds in the status file that the process is already running. But of course a crashed script will fail to set the status to "end" and hence this situation can occur.

Run via backend 

To process the queue you must either set up a cron-job on your server or use the backend to process the queue:

Process the queue via backend

Process the queue via backend

You can also (re-)crawl single URLs manually from within the Crawler log view in the info module:

Crawl single URLs via backend

Crawl single URLs via backend

Building and Executing queue right away (from cli) 

An alternative mode is to automatically build and execute the queue from the command line in one process. This doesn't allow scheduling of task processing and consumes as much CPU as it can. On the other hand the job is done right away. In this case the queue is both built and executed right away.

The script to use is this:

vendor/bin/typo3 crawler:buildQueue <startPageUid> <configurationKeys>
Copied!

If you run it you will see a list of options which explains usage.

<startPageUid>

<startPageUid>
Type
integer

Page Id of the page to use as starting point for crawling.

<configurationKeys>

<configurationKeys>
Type
string

Configurationkey:

Comma separated list of your crawler configurations. If you use the crawler configuration records you have to use the "name" if you're still using the old TypoScript based configuration you have to use the configuration key which is also a string.

Examples:

re-crawl-pages,re-crawl-news
Copied!

--number <number>

--number <number>
Type
integer

Specifies how many items are put in the queue per minute. Only valid for output mode "queue".

--mode <mode>

--mode <mode>
Type
string
Default
queue

Output mode: "url", "exec", "queue"

  • url : Will list URLs which wget could use as input.
  • queue: Will put entries in queue table.
  • exec: Will execute all entries right away!

--depth <depth>

--depth <depth>
Type
integer
Default
0

Tree depth, 0-99.

How many levels under the 'page_id' to include. By default, no additional levels are included.

Example 

We want to crawl pages under the page "Content Examples" (uid=6) and 2 levels down, with the default crawler configuration.

This is done like this in the backend.

To do the same with the CLI script you run this:

vendor/bin/typo3 crawler:buildQueue 6 default --depth 2
Copied!

And this is the output:

38 entries found for processing. (Use "mode" to decide action):

[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/overview
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/rich-text
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/headers
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/bullet-list
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/text-with-teaser
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/text-and-icon
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/text-in-columns
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/list-group
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/panel
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/table
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/text/quote
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/media/audio
[10-04-20 10:35] https://crawler-devbox.ddev.site/content-examples/media/text-and-images
...
[10-04-20 10:36] https://crawler-devbox.ddev.site/content-examples/and-more/frames
Copied!

At this point you have three options for "action":

  • Commit the URLs to the queue and let the cron script take care of it over time. In this case there is an option for setting the amount of tasks per minute if you wish to change it from the default 30. This is useful if you would like to submit a job to the cron script based crawler everyday.

    • Add "--mode queue"
    • This is also the default setting, so unless you want it to be explicit visible, you don't need to add it.
  • List full URLs for use with wget or similar. Corresponds to pressing the "Download URLs" button in the backend module.

    • Add "--mode url"
    $ bin/typo3 crawler:buildQueue 6 default --depth 2 --mode url
    https://crawler-devbox.ddev.site/content-examples/overview
    https://crawler-devbox.ddev.site/content-examples/text/rich-text
    https://crawler-devbox.ddev.site/content-examples/text/headers
    https://crawler-devbox.ddev.site/content-examples/text/bullet-list
    https://crawler-devbox.ddev.site/content-examples/text/text-with-teaser
    https://crawler-devbox.ddev.site/content-examples/text/text-and-icon
    https://crawler-devbox.ddev.site/content-examples/text/text-in-columns
    https://crawler-devbox.ddev.site/content-examples/text/list-group
    https://crawler-devbox.ddev.site/content-examples/text/panel
    
    Copied!
  • Commit and execute the queue right away. This will still put the jobs into the queue but execute them immediately. If server load is no issue to you and if you are in a hurry this is the way to go! It also feels much more like the "command-line-way" of things. And the status output is more immediate than in the queue.

    • Add "--mode exec"
    $ bin/typo3 crawler:buildQueue 6 default --depth 2 --mode exec
    https://crawler-devbox.ddev.site/content-examples/overview
    https://crawler-devbox.ddev.site/content-examples/text/rich-text
    https://crawler-devbox.ddev.site/content-examples/text/headers
    https://crawler-devbox.ddev.site/content-examples/text/bullet-list
    https://crawler-devbox.ddev.site/content-examples/text/text-with-teaser
    https://crawler-devbox.ddev.site/content-examples/text/text-and-icon
    https://crawler-devbox.ddev.site/content-examples/text/text-in-columns
    https://crawler-devbox.ddev.site/content-examples/text/list-group
    https://crawler-devbox.ddev.site/content-examples/text/panel
    ...
    Processing
    
    https://crawler-devbox.ddev.site/content-examples/overview () =>
    
    OK:
            User Groups:
    
    https://crawler-devbox.ddev.site/content-examples/text/rich-text () =>
    
    OK:
            User Groups:
    
    https://crawler-devbox.ddev.site/content-examples/text/headers () =>
    
    OK:
            User Groups:
    
    https://crawler-devbox.ddev.site/content-examples/text/bullet-list () =>
    
    OK:
            User Groups:
    ...
    
    Copied!

Scheduler 

As seen in Executing the queue you can execute the queue in multiple ways, but it's no fun doing that manually all the time.

With the Crawler you have the possibility to add Scheduler Tasks to be executed on a give time. The Crawler commands are implemented with the Symfony Console, and therefore they can be configured with the Core supported Execute console commands (scheduler) task.

So how to setup crawler scheduler tasks:

  1. Add a new Scheduler Task
  2. Select the class Execute console commands
  3. Select Frequency for the execution
  4. Go to section Schedulable Command. Save and reopen to define command arguments at the bottom.
  5. Select e.g. crawler:buildQueue (press save)
  6. Select the options you want to execute the queue with, it's important to check the checkboxes and not only fill in the values.

Now you can save and close, and your scheduler tasks will be running as configured.

The configured task will look like this:

Task configuration for building the queue

Task configuration for building the queue

And after save and close, you can see what command is executed, it would be the same parameters, you can use when running from cli, see Building and Executing queue right away (from cli)

Task in the scheduled tasks overview

Task in the scheduled tasks overview

Use cases 

This section is made to show different use cases for the crawler, and what value it can bring by installing it. The crawler has transformed over the years to have multiple use cases. If you have some that is not listed here, feel free to make a PR or issue on https://github.com/tomasnorre/crawler.

Cache warm up 

To have a website that is fast for the end-user is essential, therefore having a warm cache even before the first user hits the newly deployed website, will be beneficial, so how could one achieve this?

The crawler have some command line tools (hereafter cli tools) that can be used, during deployments. The cli tools is implemented with the symfony/console which have been standard in TYPO3 for a while.

There are 3 commands that can be of you benefit during deployments.

  • vendor/bin/typo3 crawler:flushQueue
  • vendor/bin/typo3 crawler:buildQueue
  • vendor/bin/typo3 crawler:processQueue

You can see more on which parameters they take in Run via command controller, this example will provide suggestion on how you can set it up, and you can adjust with additional parameters if you like.

  1. Create crawler configuration

    First we need a crawler configuration these are stored in the database. You can add it via the backend, see Configuration records.

    It's suggested to select the most important pages of the website and add them to a Crawler configuration called e.g. deployment:

    Crawler configuration record

    Crawler configuration record

  2. Build the queue

    With this only pages added will be crawled when using this configuration. So how will we execute this from CLI during deployment? I don't know which deployment tool you use, but it's not important as long as you can execute shell commands. What would you need to execute?

    # Done to make sure the crawler queue is empty, so that we will only crawl important pages.
    $ vendor/bin/typo3 crawler:flushQueue all
    
    # Now we want to fill the crawler queue,
    # This will start on page uid 1 with the deployment configuration and depth 99,
    # --mode exec crawles the pages instantly so we don't need a secondary process for that.
    $ vendor/bin/typo3 crawler:buildQueue 1 deployment --depth 99 --mode exec
    
    # Add the rest of the pages to crawler queue and have the processed with the scheduler
    # --mode queue is default, but it is  added for visibility,
    # we assume that you have a crawler configuration called default
    $ vendor/bin/typo3 crawler:buildQueue 1 default --depth 99 --mode queue
    
    Copied!
  3. Process the queue

    The last step will add the pages to the queue, and you would need a scheduler task setup to have them processed. Go to the Scheduler module and do following steps:

    1. Add a new Scheduler Task
    2. Select the Execute console commands
    3. Select Frequency for the execution
    4. Go to section Schedulable Command. Save and reopen to define command arguments at the bottom.
    5. Select crawler:processQueue (press save)
    6. Select the options you want to execute the queue with, it's important to check the checkboxes and not only fill in the values.
    Options of the task

    Options of the task

With there steps you will have a website that is faster by the first visit after a deployment, and the rest of the website is crawled automatically shortly after.

#HappyCrawling

Automatic add pages to Queue 

New in version 9.1.0

Edit Pages 

With this feature, you will automatically add pages to the crawler queue when you are editing content on the page, unless it's within a workspace, then it will not be added to the queue before it's published.

This functionality gives you the advantages that you would not need to keep track of which pages you have edited, it will automatically be handle on next crawler process task, see Executing the queue. This ensure that your cache or e.g. Search Index is always up to date and the end-users will see the most current content as soon as possible.

Clear Page Single Cache 

As the edit and clear page cache function is using the same dataHandler hooks, we have an additional feature for free. When you clear the page cache for a specific page then it will also be added automatically to the crawler queue. Again this will be processed during the next crawler process.

Clearing the page cache

Clearing the page cache

Page is added to the crawler queue

Page is added to the crawler queue

Pollable processing instructions 

Some processing instructions are never executed on the "client side" (the TYPO3 frontend that is called by the crawler). This happens for example if a try to staticpub a page containing non-cacheable elements. That bad thing about this is, that staticpub doesn't have any chance to tell that something went wrong and why. That's why we introduced the "pollable processing instructions" feature. You can define in the ext_localconf.php file of your extension that this extension should be "pollable" bye adding following line:

packages/my_extension/ext_localconf.php
$GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pollSuccess'][] = 'tx_staticpub';
Copied!

In this case the crawler expects the extension to tell if everything was ok actively, assuming that something went wrong (and displaying this in the log) is no "success message" was found.

In your extension than simple write your "ok" status by calling this:

packages/my_extension/ext_localconf.php
$GLOBALS['TSFE']->applicationData['tx_crawler']['success']['tx_staticpub'] = true;
Copied!

Multi process support 

If you want to optimize the crawling process for speed (instead of low server stress), maybe because the machine is a dedicated staging machine you should experiment with the new multi process features.

In the extension settings you can set how many processes are allowed to run at the same time, how many queue entries a process should grab and how long a process is allowed to run. Then run one (or even more) crawling processes per minute. You'll be able to speed up the crawler quite a lot.

But choose your settings carefully as it puts loads on the server.

Backend configuration: Processing

Backend configuration: Processing

Hooks 

Register the following hooks in ext_localconf.php of your extension.

excludeDoktype Hook 

By adding doktype ids to following array you can exclude them from being crawled:

packages/my_extension/ext_localconf.php
$GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['excludeDoktype'][] = <dokTypeId>
Copied!

pageVeto Hook 

Deprecated since version 11.0.0

You can also decide whether a page should not be crawled in an individual userfunction. Register your function here:

packages/my_extension/ext_localconf.php
$GLOBALS['TYPO3_CONF_VARS']['EXTCONF']['crawler']['pageVeto'][] = MyVendor\MyExtension\Hooks\Crawler\PageVeto::class . '->excludePage';
Copied!
packages/my_extension/Classes/Hooks/Crawler/PageVeto.php
<?php

declare(strict_types=1);

namespace MyVendor\MyExtension\Hooks\Crawler;

use AOE\Crawler\Controller\CrawlerController;

class PageVeto
{
    public function excludePage(array &$params, CrawlerController $controller)
    {
        if ($params['pageRow']['uid'] === 42) {
            return 'Page with uid "42" is excluded by page veto hook';
        }

        return false;
    }
}
Copied!

PSR-14 Events 

New in version 11.0.0

You can register your own PSR-14 Event Listener and extend the functionality of the TYPO3 Crawler. In this section you will see which events that you can listen too.

ModifySkipPageEvent 

With this event, you can implement your own logic whether a page should be skipped or not, this can be basically a skip by uid, like in the example below. It can also be a more complex logic that determines if it should be skipped or not.

Let's say you don't want to crawl pages with SEO priority 0.2 or lower. This would then be the place to add your own listener to Modify the Skip Page logic already implemented.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/ModifySkipPageEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\ModifySkipPageEvent;
    
    final class ModifySkipPageEventListener
    {
        public function __invoke(ModifySkipPageEvent $modifySkipPageEvent)
        {
            if ($modifySkipPageEvent->getPageRow()['uid'] === 42) {
                $modifySkipPageEvent->setSkipped('Page with uid "42" is excluded by ModifySkipPageEvent');
            }
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      AOE\Crawler\EventListener\ModifySkipPageEventListener:
        tags:
          -   name: event.listener
              identifier: 'ext-extension-key/ModifySkipPageEventListener'
              event: AOE\Crawler\Event\ModifySkipPageEvent
    
    Copied!

AfterUrlCrawledEvent 

This event enables you to trigger, e.g a Vanish Ban for a specific URL after it's freshly crawled. This ensures that your varnish cache will be up to date as well.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/AfterUrlCrawledEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\AfterUrlCrawledEvent;
    
    final class AfterUrlCrawledEventListener
    {
        public function __invoke(AfterUrlCrawledEvent $afterUrlCrawledEvent)
        {
            // VarnishBanUrl($afterUrlCrawledEvent->$afterUrl());
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      AOE\Crawler\EventListener\AfterUrlCrawledEventListener:
        tags:
          -   name: event.listener
              identifier: 'ext-extension-key/AfterUrlCrawledEventListener'
              event: AOE\Crawler\Event\AfterUrlCrawledEvent
    
    Copied!

InvokeQueueChangeEvent 

The InvokeQueueChangeEvent enables you to act on queue changes, it can be e.g. automatically adding new processes. The event takes a Reason as arguments which gives you more information about what has happened and for GUI also by whom.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/AfterUrlCrawledEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\InvokeQueueChangeEvent;
    
    final class InvokeQueueChangeEventListener
    {
        public function __invoke(InvokeQueueChangeEvent $invokeQueueChangeEvent)
        {
            $reason = $invokeQueueChangeEvent->getReasonText();
            // You can implement different logic based on reason, GUI or CLI
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      AOE\Crawler\EventListener\InvokeQueueChangeEvent:
        tags:
          - name: event.listener
            identifier: 'ext-extension-key/InvokeQueueChangeEventListener'
            event: AOE\Crawler\Event\InvokeQueueChangeEvent
    
    Copied!

AfterUrlAddedToQueueEvent 

AfterUrlAddedToQueueEvent gives you the opportunity to trigger desired actions based on e.g. which fields are changed. You have uid and fieldArray present for evaluation.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/AfterUrlAddedToQueueEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\AfterUrlAddedToQueueEvent;
    
    final class AfterUrlAddedToQueueEventListener
    {
        public function __invoke(AfterUrlAddedToQueueEvent $afterUrlAddedToQueueEvent): void
        {
            // Implement your wanted logic, you have the `$uid` and `$fieldArray` information
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      MyVendor\MyExtension\EventListener\AfterUrlAddedToQueueEventListener:
        tags:
          -   name: event.listener
              identifier: 'ext-extension-key/AfterUrlAddedToQueueEventListener'
              event: AOE\Crawler\Event\AfterUrlAddedToQueueEvent
    
    Copied!

BeforeQueueItemAddedEvent 

This event can be used to check or modify a queue record before adding it to the queue. This can be useful if you want certain actions in place based on, let's say Doktype or SEO Priority.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/BeforeQueueItemAddedEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\BeforeQueueItemAddedEvent;
    
    final class BeforeQueueItemAddedEventListener
    {
        public function __invoke(BeforeQueueItemAddedEvent $beforeQueueItemAddedEvent)
        {
            // Implement your wanted logic, you have the `$queueId` and `$queueRecord` information
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      MyVendor\MyExtension\BeforeQueueItemAddedEventListener:
        tags:
          -   name: event.listener
              identifier: 'ext-extension-key/BeforeQueueItemAddedEventListener'
              event: AOE\Crawler\Event\BeforeQueueItemAddedEvent
    
    Copied!

AfterQueueItemAddedEvent 

The AfterQueueItemAddedEvent can be helpful if you want a given action after the item is added. Here you have the queueId and fieldArray information for you usages and checks.

  1. Create the event listener

    packages/my_extension/Classes/EventListener/AfterQueueItemAddedEventListener.php
    <?php
    
    declare(strict_types=1);
    
    namespace MyVendor\MyExtension\EventListener;
    
    use AOE\Crawler\Event\AfterQueueItemAddedEvent;
    
    final class AfterQueueItemAddedEventListener
    {
        public function __invoke(AfterQueueItemAddedEvent $afterQueueItemAddedEvent)
        {
            // Implement your wanted logic, you have the `$queueId` and `$fieldArray` information
        }
    }
    
    Copied!
  2. Register your event listener in Configuration/Services.yaml

    packages/my_extension/Configuration/services.yaml
    services:
      MyVendor\MyExtension\EventListener\AfterQueueItemAddedEventListener:
        tags:
          -   name: event.listener
              identifier: 'ext-extension-key/AfterQueueItemAddedEventListener'
              event: AOE\Crawler\Event\AfterQueueItemAddedEvent
    
    Copied!

Priority Crawling 

New in version 9.1.0

Some website has a quite large number of pages. Some pages are logically more important than others e.g. the start-, support-, product-, you name it-pages. These important pages are also the pages where we want to have the best caching and performance, as they will most likely be the pages with the most changes and the most traffic.

With TYPO3 10 LTS the sysext/seo introduced among other things, the sitemap_priority, which is used to generate an SEO optimised sitemap.xml where page priorities are listed as well. Their priorities will most likely be higher the more important the page is for you and the end-user.

This logic is something that we can benefit from in the Crawler as well. A Website with let us say 10.000 pages, will have different importance depending on the page you are at. Therefore we have changed the functionality of the crawler, to take the value of this field, range from 0.0 to 1.0, into consideration when processing the crawler queue. This means that if you have a page with high priority for your sitemap, it will also be crawled first when a new crawler process is added.

This ensures that we will always crawl the pages that have the highest importance to you and your end-user based on your sitemap priority. We choose to reuse this field, to not have editors doing work that is more or less similar twice.

If you don't want to use this functionality, it's ok. You can just ignore the options that the sysext/seo gives you and all pages will by default get a priority 0.5, and therefore do not influence the processing order as everyone will have the same priority.

The existing SEO tab will be used to set priorities when editing pages.

The SEO tab will contain the sitemap_priority field

The SEO tab will contain the sitemap_priority field

Troubleshooting 

Problem reading data in Crawler Queue 

With the crawler release 9.1.0 we have changed the data stores in crawler queue from serialized to json data. If you are experiencing problems with the old data still in your database, you can flush your complete crawler queue and the problem should be solved.

We have build in a JsonCompatibilityConverter to ensure that this should not happen, but in case of it run:

$ vendor/bin/typo3 crawler:flushQueue all

Copied!

Make Direct Request doesn't work 

If you are using direct request, see Extension Manager Configuration, and it doesn't give you any result, or that the scheduler tasks stalls.

It can be because of a faulty configured TrustedHostPattern, this can be changed in the LocalConfiguration.php.

$GLOBALS['TYPO3_CONF_VARS']['SYS']['trustedHostsPattern'] = '<your-pattern>';
Copied!

Crawler want process all entries from command line 

The crawler won't process all entries at command-line-way. This might happened because the php run into an time out, to avoid this you can call the crawler like:

php -d max_execution_time=512 vendor/bin/typo3 crawler:buildQueue
Copied!

Crawler Count is 0 (zero) 

If you experiences that the crawler queue only adds one url to the queue, you are probably on a new setup, or an update from TYPO3 8LTS you might have some migration not executed yet.

Please check the Upgrade Wizard, and check if the Introduce URL parts ("slugs") to all existing pages is marked as done, if not you should perform this step.

See related issue: [BUG] Crawling Depth not respected #464

Update from older versions 

If you update the extension from older versions you can run into following error:

SQL error: 'Field 'sys_domain_base_url' doesn't have a default value'
Copied!

Make sure to delete all unnecessary fields from database tables. You can do this in the backend via Analyze Database Structure tool or if you have TYPO3 Console installed via command line command vendor/bin/typo3cms database:updateschema.

TYPO3 shows error if the PHP path is not correct 

In some cases you get an error, if the PHP path is not set correctly. It occures if you select the Site Crawler in Info-module.

Error message in Info-module

Error message in Info-module

In this case you have to set the path to your PHP in the Extension configuration.

Correct PHP path settings

Correct PHP path settings in Extension configuration

Please be sure to add the correct path to your PHP. The path in this screenshot might be different to your PHP path.

Info Module throws htmlspecialchars() expects parameter 1 to be string 

We have had a bug in the Crawler for a while, which I had difficulties figuring out. The bug is cause by a problem with the CrawlerHook in the TYPO3 Core, as this is remove in TYPO3 11.

I will not try to provide a fix for this, but only a workaround.

Workaround 

The problem appears when the Crawler Configuration and the Indexed_Search Configuration are stored on the same page. The workaround is then to move the Indexed_Search Configuration to a different page. I have not experience any side-effects on this change, but if you do so, please report them to me.

This workaround is for these two bugs:

https://github.com/tomasnorre/crawler/issues/576 and https://github.com/tomasnorre/crawler/issues/739

If you would like to know more about what's going it, you can look at the core:

https://github.com/TYPO3/TYPO3.CMS/blob/10.4/typo3/sysext/indexed_search/Classes/Hook/CrawlerHook.php#L156

Here a int value is submitted instead of a String. This is a change that goes more than 8 years back. So surprised that it never was a problem before.

Crawler Log shows "-" as result 

In Crawler v11.0.0 after introducing PHP 8.0 compatibility. We are influenced by a bug in the PHP itself https://bugs.php.net/bug.php?id=81320, this bugs make the Crawler status an invalid JSON and can therefore not render the correct result. It will display the result in the Crawler Log as -.

Even though the page is correct crawler, the status is incorrect, which is of course not desired.

Workaround 

On solution can be to remove the php8.0-uploadprogress package from your server. If this version is below 1.1.4, this will trigger the problem. Removing the package can of course be a problem if you are depending on it.

If possible, better update it to 1.1.4 or higher, then the problem should be solved as well.

Site config baseVariants not used 

An issue was reported for the Crawler, that the Site Config baseVariants was not respected by the Crawler. https://github.com/tomasnorre/crawler/issues/851, it turned out that crawler had problems with ApplicationContexts set in .htaccess like in example.

<IfModule mod_rewrite.c>
   # Rules to set ApplicationContext based on hostname
   RewriteCond %{HTTP_HOST} ^(.*)\.my\-site\.localhost$
   RewriteRule .? - [E=TYPO3_CONTEXT:Development]
   RewriteCond %{HTTP_HOST} ^(.*)\.mysite\.info$
   RewriteRule .? - [E=TYPO3_CONTEXT:Production/Staging]
   RewriteCond %{HTTP_HOST} ^(.*)\.my\-site\.info$
   RewriteRule .? - [E=TYPO3_CONTEXT:Production]
</IfModule>
Copied!

Workaround 

this problem isn't solved, but it can be bypassed by using the helhum/dotenv-connector https://github.com/helhum/dotenv-connector

X-T3Crawler-Meta header missing 

When the crawler log reports "Response has no X-T3Crawler-Meta header", then a firewall probably filters incoming or outgoing HTTP headers.

Crawler sends a X-T3Crawler header to TYPO3 and expects a X-T3Crawler-Meta in the response. If those are removed in transit, crawler will not work.

Sitemap