Indexed Search
The TYPO3 Crawler is quite often used to regenerate the index of Indexed Search.
Frontend indexing setup
Here we will configure indexed_
to automatically index pages when
they are visited by users in the frontend.
- Make sure you do not have config.no_cache set in your TypoScript configuration - this prevents indexing.
- Admin Tools > Settings > Extension configuration > indexed_search: Make sure "Disable Indexing in Frontend" is disabled (thus frontend indexing is enabled).
-
Web > List: In your site root, create a new "Indexing Configuration" record.
- Type:
Page tree
- Depth:
4
- Access > Enable: Activate it
Save.
- Type:
- Edit the page settings of a visible page and make sure that Behaviour > Miscellaneous > Include in search is activated.
- View this page in the frontend.
- Web > Indexing > Detailed statistics: The page you just visited is shown as "Indexed" now - with Filename, Filesize and indexing timestamp.
If this did not work, clear both frontend and all caches. Getting frontend indexing to work is crucial for the rest of this How-To.
Crawler setup
Now that frontend indexing works, it's time to configure crawler to re-index all pages - instead of relying on visitors to trigger indexing:
- Admin Tools > Settings > Extension configuration > indexed_search: Enable "Disable Indexing in Frontend", so that indexing only happens through the crawler.
-
Web > List: In your site root, create a new "Crawler configuration" record.
- Name:
crawl-
mysite - Processing instruction filter: Enable "Re-Indexing [tx_indexedsearch_reindex]"
Save.
- Name:
-
Do a manual crawl on command line. "23" is the site root page UID:
$ ./vendor/bin/typo3 crawler:buildQueue 23 crawl-mysite --depth 2 --mode exec -vvv Executing 2 requests right away: [19-08-25 14:13] http://example.org/ (URL already existed) [19-08-25 14:13] http://example.org/faq (URL already existed) <warning>Internal: (Because page is hidden)</warning> <warning>Tools: (Because doktype "254" is not allowed)</warning> Processing http://example.org/ (tx_indexedsearch_reindex) => OK: User Groups: http://example.org/faq (tx_indexedsearch_reindex) => OK: User Groups: 2/2 [============================] 100% 1 sec/1 sec 42.0 MiB
Copied! - Web > Indexing: All pages should be indexed now.
Nightly crawls
We want crawler
to run automatically at night:
-
Create the first scheduler task that will create a list with page URLs that the second task will crawl.
System > Scheduler > +:
- "Class" is "Execute console commands"
- "Frequency" is every night at 2 o'clock:
0 2 * * *
- "Schedulable Command" must be "crawler:buildQueue"
Save and continue editing:
- "Argument: page" must be the UID of the site root page (
23
) - "Argument: conf" is
crawl-
mysite - "Option: depth" must be enabled and set to
99
Save.
-
Run the task manually, either via the scheduler module in the backend or via command line:
$ ./vendor/bin/typo3 scheduler:run --task=1 -f -vv Task #1 was executed
Copied!(
1
is the scheduler task ID) - Verify that the pages have been queued by looking at Web > Info > Site Crawler > Crawler log > 2 levels. The pages have a timestamp in the "Scheduled" column.
-
Create the second scheduler task that will index all the page URLs queued by the first task:
System > Scheduler > +:
- "Class" is "Execute console commands"
- "Frequency" is every 10 minutes:
*/
10 * * * * - "Schedulable Command" must be "crawler:processQueue"
Save and continue editing:
- "Option: amount" should be
50
, or any value that the system is able to process within the 10 minutes.
Save.
-
Run the task manually, again via the scheduler module in the backend (only if it's a small page!) or via command line:
$ ./vendor/bin/typo3 scheduler:run --task=2 -f -vv Task #2 was executed
Copied!This crawl task will run much longer that the queue task.
-
Verify that the pages have been indexed by looking at Web > Indexing. All queued pages should have an index date now.
Web > Info > Site Crawler > Crawler log > 2 levels should show a timestamp in the "Run-time" column, as well as
OK
in the "Status" column.
This completes the basic crawler setup. Every night at 2:00, all pages will be re-indexed in batches of 50.