Indexed Search

The TYPO3 Crawler is quite often used to regenerate the index of Indexed Search.

Frontend indexing setup

Here we will configure indexed_search to automatically index pages when they are visited by users in the frontend.

Make sure you do not have config.no_cache set in your TypoScript configuration - this prevents indexing.
Admin Tools > Settings > Extension configuration > indexed_search: Make sure "Disable Indexing in Frontend" is disabled (thus frontend indexing is enabled).
Web > List: In your site root, create a new "Indexing Configuration" record.
- Type: Page tree
- Depth: 4
- Access > Enable: Activate it
Save.
Edit the page settings of a visible page and make sure that Behaviour > Miscellaneous > Include in search is activated.
View this page in the frontend.
Web > Indexing > Detailed statistics: The page you just visited is shown as "Indexed" now - with Filename, Filesize and indexing timestamp.

If this did not work, clear both frontend and all caches. Getting frontend indexing to work is crucial for the rest of this How-To.

Crawler setup

Now that frontend indexing works, it's time to configure crawler to re-index all pages - instead of relying on visitors to trigger indexing:

Admin Tools > Settings > Extension configuration > indexed_search: Enable "Disable Indexing in Frontend", so that indexing only happens through the crawler.
Web > List: In your site root, create a new "Crawler configuration" record.
- Name: crawl-mysite
- Processing instruction filter: Enable "Re-Indexing [tx_indexedsearch_reindex]"
Save.

Do a manual crawl on command line. "23" is the site root page UID:

$ ./vendor/bin/typo3 crawler:buildQueue 23 crawl-mysite --depth 2 --mode exec -vvv

Executing 2 requests right away:
[19-08-25 14:13] http://example.org/ (URL already existed)
[19-08-25 14:13] http://example.org/faq (URL already existed)
<warning>Internal:  (Because page is hidden)</warning>
<warning>Tools:  (Because doktype "254" is not allowed)</warning>
Processing

http://example.org/ (tx_indexedsearch_reindex) =>

OK:
   User Groups:

http://example.org/faq (tx_indexedsearch_reindex) =>

OK:
   User Groups:

2/2 [============================] 100%  1 sec/1 sec  42.0 MiB

Web > Indexing: All pages should be indexed now.

Nightly crawls

We want crawler to run automatically at night:

Create the first scheduler task that will create a list with page URLs that the second task will crawl.

System > Scheduler > +:
- "Class" is "Execute console commands"
- "Frequency" is every night at 2 o'clock: 0 2 * * *
- "Schedulable Command" must be "crawler:buildQueue"
Save and continue editing:
- "Argument: page" must be the UID of the site root page (23)
- "Argument: conf" is crawl-mysite
- "Option: depth" must be enabled and set to 99
Save.
Run the task manually, either via the scheduler module in the backend or via command line:
```
$ ./vendor/bin/typo3 scheduler:run --task=1 -f -vv
Task #1 was executed
```
Copied!
(1 is the scheduler task ID)
Verify that the pages have been queued by looking at Web > Info > Site Crawler > Crawler log > 2 levels. The pages have a timestamp in the "Scheduled" column.
Create the second scheduler task that will index all the page URLs queued by the first task:

System > Scheduler > +:
- "Class" is "Execute console commands"
- "Frequency" is every 10 minutes: */10 * * * *
- "Schedulable Command" must be "crawler:processQueue"
Save and continue editing:
- "Option: amount" should be 50, or any value that the system is able to process within the 10 minutes.
Save.
Run the task manually, again via the scheduler module in the backend (only if it's a small page!) or via command line:
```
$ ./vendor/bin/typo3 scheduler:run --task=2 -f -vv
Task #2 was executed
```
Copied!
This crawl task will run much longer that the queue task.
Verify that the pages have been indexed by looking at Web > Indexing. All queued pages should have an index date now.

Web > Info > Site Crawler > Crawler log > 2 levels should show a timestamp in the "Run-time" column, as well as OK in the "Status" column.

This completes the basic crawler setup. Every night at 2:00, all pages will be re-indexed in batches of 50.