Introduction 

What does it do? 

The Indexed Search Engine provides two major elements to TYPO3:

  1. Indexing: An indexing engine which indexes TYPO3 pages on-the-fly as they are rendered by TYPO3's frontend. Indexing a page means that all words from the page (or specifically defined areas on the page) are registered, counted, weighted and finally inserted into a database table of words. Then another table will be filled with relation records between the word table and the page. This is the basic idea.
  2. Searching: A plugin you can insert on your website which allows website users to search for information on your website. By searching the plugin first looks in the word-table if the word exist and if it does all pages which has a relation to that word will be considered for the search result display. The search results are ordered based on factors like where on the page the word was found or the frequency of the word on the page.

This is an example of how the search interface on a website looks:

Frontend search results

Search results in the frontend

Features of the indexer 

The indexing engine has several features:

  • HTML data priority: 1) <title>-data 2) <meta-keywords>, 3) <meta- description>, 4) <body>
  • Indexing external files: Text formats like html and txt and doc, pdf by external programs (catdoc / pdftotext)
  • Wordcounting and frequency used to rate results
  • Exact, partially or metaphone search
  • Searching freely for sentences (non-indexed).
  • NOT case-sensitive in any ways though.

Features of the search frontend (the plugin) 

The search interface has several options for advanced searching. Any of those can be disabled and/or preset with default values:

  • Searching whole word, part of word, sounds like, sentence
  • Logical AND and OR search including syntactical recognition of AND, OR and NOT as logical keywords. Furthermore sentences encapsulated in quotes will be recognized.
  • Searching can be targeted at specific media, for instance searching only indexed PDF files, HTML-files, Word-files, TYPO3-pages or everything
  • The engine is language-sensitive based on the multiple-language feature of the TYPO3 CMS frontend.
  • Searching can be performed in specific sections of the website.
  • Results can be sorted descending or ascending and ordered by word frequency, weight, location relative to page top, page modification date, page title, etc.
  • The display of search results can be intelligently divided into sections based on the internal page hierarchy. Thus results are primarily grouped by relation, then by hit-relevance.

This shows the full range of default options for "advanced search":

Advanced search options

All possible advanced search options

User manual 

Adding the search plugin to a page 

  1. Activate "indexed_search" in the Extensions Manager. You just need to activate it. It is already installed in non-composer mode. Make sure to require the package typo3/cms-indexed-search if you run the TYPO3 instance in composer mode.
  2. Create a page called "Search" or something like that. This is where the search box will appear.
  3. Create an extension template for this page that includes "Indexed Search (Extbase & Fluid)" or include it in your main template.
  4. Create a new content element on that page and choose the type "General Plugin".
  5. Then choose the "Selected plugin" to be "Indexed search":
Indexed Search plugin type

Choosing "Indexed search" as a plugin type

That's it. Your frontend should now look like this:

Frontend search form

Default view in the frontend, search form and rules help text

The styles are most likely different from this, but that is controlled by the developer having administration access to the system.

Installation 

This extension is part of the TYPO3 Core, but not installed by default.

Table of contents

Installation with Composer 

Check whether you are already using the extension with:

composer show | grep indexed
Copied!

This should either give you no result or something similar to:

typo3/cms-indexed-search       v12.4.11
Copied!

If it is not installed yet, use the composer require command to install the extension:

composer require typo3/cms-indexed-search
Copied!

The given version depends on the version of the TYPO3 Core you are using.

Installation without Composer 

In an installation without Composer, the extension is already shipped but might not be activated yet. Activate it as follows:

  1. In the backend, navigate to the Admin Tools > Extensions module.
  2. Click the Activate icon for the Indexed Search extension.
Extension manager showing Indexed Search extension

Extension manager showing Indexed Search extension

Administration 

Monitoring indexed content 

The Indexed Search extension adds two backend modules, one as a global database-wide statistics module and a page specific analysis module.

In the Web > Indexing module (sub module Detailed statistics) you can see an overview of how many instances are indexed per TYPO3-page. Look at this image:

Page indexing statistics

Indexing statistics per page

As you can see most pages here are indexed only once. However a few are indexed twice. This can happen for several reasons and here the reason is most likely due to a user login or something related.

The most interesting occurrence is the page "References" which has more than 20 indexed instances available. The reason is that this page holds multiple cached views due to some parameters which are used by a plugin on that page. Each instance will be searchable as a unique search result.

Now imagine that you want to clear out all those instances of the "References" page to let them be re-indexed when viewed again; Simply click the page "References" in the page tree to the left. Then you see this:

Page indexing clean up

Removing some indexing information to allow for reindexing

You can either click the red garbage bin (1) in order to clear all listed instances or alternatively pick out single instances by clicking the local garbage bin (2).

Monitoring the global picture of indexed pages 

By the Tools > Indexing module you can get statistics about the indexing engine. Currently they are sparse and very roughly presented. This view needs some more work to be friendly and really useful.

The global indexing module

Selecting the global Indexing module in the Admin Tools

"General statistics" 

This shows that 217 pages are indexed, comprising 7000+ words and using 40.000 records in the relation table to glue things together.

Global statistics

Global indexing statistics

"List: TYPO3 Pages" 

This view shows a list of indexed pages with all the technical details:

Global view of page details

Technical details for each page

Setting up the "crawler" extension 

Before you can work with "Indexing configurations" you must make sure you have set up the "crawler" extension and have a cron-job running that will process the crawler queue as we fill it! For this, please refer to the documentation of the "crawler" extension!

Generally about indexing configurations 

Indexing configuration sets up indexing jobs that are performed by a cron-script independently of frontend requests. The "crawler" extension is used as a service to perform the execution of queue entries that controls the indexing.

The Indexing configuration contains two parts

  1. Definition of execution time and period.
  2. Definition of indexing type and settings.

Below you see what all Indexing Configurations have in common:

Common indexing configurations

Common parameters in indexing configurations

These settings are described in the context sensitive help so please refer to that for more information.

The "Session ID" requires a show introduction: When an indexing job is started it will set this value to a unique number which is used as ID for that process and all indexed entries are tagged with it. When the processing of an indexing configuration is done it will be reset to zero again.

The title of a configuration can be translated in order to ease usage for backend users who use a different language than your default one. Translation strings can be provided via TypoScript:

plugin.tx_indexedsearch.settings._LOCAL_LANG {
    de.indexingConfigurations.13 = Mein Titel in Deutsch für Konfiguration 13
    de.indexingConfigurationHeader.13 = Alle Ergebnisse für Konfiguration 13
}
Copied!

Periodic indexing of the website ("Page tree") 

You can have the whole page tree indexed overnight using this indexing configuration of type "Page tree":

Indexing configuration for whole site

Settings in an indexing configuration for the full page tree

This defines that the page tree is to be crawled to a depth of 3 levels from the root point "Testsite". For each page a combination of parameters is calculated based on the "crawler" configurations for the "Re-index" processing instruction (See "crawler" extension for more information) and those URLs are committed to the crawler log plus entries for all subpages to the processed page (so that each of those pages are indexed as well.)

This is what the crawler log may look like after processing:

Crawler log for page tree

The crawler log after indexing the page tree

Here you can notice that the visited URLs have additional parameters added - those are combined based on the "crawler" extensions configuration in Page TSconfig.

Also notice the special crawler log entries found in the "Storage folder". These are the "meta-entries" which calls an indexed search hook which in turn generates the URL entries and pushed them to the queue.

On the far right in this view you can see that noted as well, including the "set_id":

Configuration id in the crawler log

Viewing the id of the indexing configuration in the crawler log

Finally, in the Web > Info, "Indexed search" you will see that these visited URLs were re-indexed:

Verifying the crawler's work

Verifying the reindexing by the crawler

Location: Indexing configurations for indexing of the page tree should be placed in a SysFolder since their location in the page tree is not relevant to their function.

Periodic indexing of records ("Database Records") 

You can also use the Indexing Configuration to index single records.

Location: You must place the indexing configuration on the page where you want the search results to be displayed - typically on the page where a plugin exists that can process the parameters pointing to the record. In the case below the Indexing Configuration is placed on the same page as the frontend plugin ("Morbi diam enim...") that can display the search results:

Record configuration placement

Indexing configuration for records placed in the same page as the plugin

The configuration record looks like this:

Indexing configuration for records

Indexing configuration for arbitrary records

If the records you want to index are not located on the page where the indexing configuration and frontend plugin is, then you can point to the location. Notice how the field with "GET parameters" is used to define how the search results are shown - this must correspond with what the plugin takes of parameters.

A fancy option is the "Index Records immediately when saved" - which will index records as they are saved through "DataHandler"!

In the crawler log you will see the entries for record indexing like this:

Indexing configuration for records

Indexing configuration for arbitrary records

After processing the Web > Info, "Indexed search" view will show this view:

Verifying the indexed records

Verifying the indexed records

Notice how the GET parameters are nicely added and how the "CfgUid" column contains the UID of the indexing configuration / the "set_id" of the processing.

In fact, if a record is removed its indexing entry will also be removed upon next indexing - simply because the "set_id" is used to finally clear out old entries after a re-index!

Indexing External websites ("External URL") 

You can index external websites using Indexing Configurations. They can actually crawl an external URL! Configuration looks like this:

Indexing configuration for external URL

Indexing configuration for an external URL

It pretty much explains itself how it works. The Context Sensitive Help will provide enough information to complete configuration.

Location: You should place the Indexing Configuration on a "Not- in-menu" page in the root of the site for instance. The page must be "searchable" since the external URL results are bound to a page in the page tree, namely the page where the configuration is found.

This is how the crawler log looks immediately after the crawling has begun:

Crawler log for external URL

Crawler log entries for an external URL

The initial entry is "http://typo3.org/" which is already processed. When this process was executed it added entries for all found subpages to the queue as well. When their execution time comes the crawler will request those URLs as well and if subpages are found on them, entries for those subpages are added until the configured depth is reached.

In Web > Info, "Indexed search" the indexed entries looks like this:

Verifying indexed external URLs

Verifying the list of indexed external URLs

Indexing directories of files ("Filepath on server") 

You can also have directories of files on your server indexed periodically, using the type "Filepath on server".

Indexing configuration for directories

Indexing configuration for a directory

Again, the options are either easy to understand or your can read more about them in the Context Sensitive Help.

Location: The Indexed Search configuration should be located on a not- in-menu page, just like the "External URL" type required. Same reasons; results are bound to a page in the page tree.

The process of indexing a directory of files is the same as for the external URL: For each directory a) all files are indexed and b) all sub-directories added to the crawler queue for later processing. This is shown in the crawler log:

Crawler log for directories

Crawler log entries for directories

When processing is done the result is shown in the Web > Info, "Indexed search":

Verifying indexed directories

Verifying the list of indexed directories

Showing the search results 

By default the search results are shown with no distinction between those from local TYPO3 pages, records indexed, the file path and external URLs. Only division follows that of the page on which the result is found:

Basic search results

Basic view of search results

However, you can configure to have a division of the search results into categories following the indexing configurations:

Categorized search results

Categorized view of search results

To obtain this categorization you must set TypoScript configuration in the Setup field like this:

plugin.tx_indexedsearch.search.defaultFreeIndexUidList = 0,6,7,8
plugin.tx_indexedsearch.blind.freeIndexUid = 0
Copied!

The "defaultFreeIndexUidList" is uid numbers of indexing configurations to show in the categorization! The order determines which are shown in top. Changing it could bring results from typo3.org in top:

Ordered search results

Categorized view of search results with a set order for categories

The categorization happens when the "Category" selector in the "Advanced" search form is set like this:

Categorization flag

Choosing categorization in the advanced search form

(Notice, you can preset this value from TypoScript as well!)

Searching a specific category from URL 

If you want search forms on the site to make look up directly in results belonging to one or more indexing configurations you can use a set or GET variables like these, here using UID values 7 and 8 since they look up in typo3.org results:

index.php?id=78&tx_indexedsearch[sword]=level&tx_indexedsearch[_freeIndexUid]=7,8
Copied!

Grouping several indexing configurations in one search category 

You might find that you want to group the results from multiple indexing configurations in the same category. For instance, I have an indexing configuration for "typo3.org" but I want all search results to appear under the category "External URLs". This can be done by creating a special type of indexing configuration which only points to other indexing configurations:

Grouping configurations

Grouping several indexing configurations

This indexing configuration is not used during indexing but during searching. So a reconfiguration of the TypoScript to use uid 9 instead of 7,8 will yield this result:

Grouped search results

Grouped search results

TypoScript:

plugin.tx_indexedsearch.search.defaultFreeIndexUidList = 9,6,0
Copied!

Disable frontend-initiated indexing 

If you choose to index your site using Indexing Configurations you can disable indexing through the user requests in the frontend. This is easily done via the configuration of the Indexed Search extension in the Extension Manager:

Disable frontend indexing

Disabling the frontend indexing in the extension configuration

Indexing files on pages separately 

If enabled, links to local files found on pages will initiate indexing of those external files. However, this often has the unpleasant effect that too many files are indexed during the same page request. Using the crawler extension you can configure the indexer to add a queue entry instead of immediately indexing external files. Thus the indexing will happen outside the frontend user request, using the cronscript!

This behaviour is configured in the extension managers configuration for "Indexed search":

Set crawler for linked files

Setting the crawler to be used for linked files

General 

The most basic requirement for the search engine to work is that pages are getting indexed. That will not happen by just installing the plugin! You will have to set up in TypoScript that a certain page should be indexed. That is needed for several good reasons. First of all not all sites in a TYPO3 database might need indexing. So therefore we disable it on a per-site basis. Secondly a single site may have frames and in that case we need only index the page-object which actually shows the page content.

Lets say that you have a PAGE object called "page" (that is pretty typical), then you will have to set this config-option:

page.config.index_enable = 1
Copied!

When this option is set you should begin to see your pages being indexed when they are shown next time. Remember that only cached pages are indexed!

This is documented in CONFIG section of the TSref. Please look there for further options. For instance indexing of external media can also be enabled there.

Languages 

The plugin supports all system languages in TYPO3. Translation is done using the typo3.org tools.

If you want to use eg. danish language that will automatically be used if this option is set in your template (the value is the internal language key):

config.language = da
Copied!

TypoScript 

Plugin settings 

Each of the following options is defined for the TypoScript setup path plugin.tx_indexedsearch.settings.

Target pid 

targetPid

targetPid
Type

boolean

Default

empty

Path

plugin.tx_indexedsearch.settings

Set the target page ID for the Extbase variant of the plugin. An empty value (default) falls back to the current page ID.

Display Rules 

displayRules

displayRules
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

Display the search rules.

Display result number 

displayResultNumber

displayResultNumber
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Display the numbers of search results.

Display level 1 sections 

displayLevel1Sections

displayLevel1Sections
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

This selects the first menu for the "sections" selector - so it can be searched in sections.

Display level 2 sections 

displayLevel2Sections

displayLevel2Sections
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

This selects the secondary menu for the "sections" selector - so it can be searched in sub sections. This setting only has an effect if displayLevel1Sections is true.

Display level X all types 

displayLevelxAllTypes

displayLevelxAllTypes
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Loaded are, by default:

If displayLevelxAllTypes is set to true, then the page records for all evaluated IDs are loaded directly.

Display forbidden records 

displayForbiddenRecords

displayForbiddenRecords
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Explicitly display search hits, although the visitor has no access to it.

Media list 

mediaList

mediaList
Type

string

Default

empty

Path

plugin.tx_indexedsearch.settings

Restrict the file type list when searching for files.

Root pid list 

rootPidList

rootPidList
Type

string (list of integers, separated by comma)

Default

empty

Path

plugin.tx_indexedsearch.settings

A list of integers which should be root pages to search from. Thus you can search multiple branches of the page tree by setting this property to a list of page ID numbers.

If this value is set to less than zero (eg. -1), the search will be performed in ALL parts of the page tree without regard to branches at all. An empty value (default) falls back to the current root page ID.

Detect domain records 

detectDomainRecords

detectDomainRecords
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

If set, the search results are linked to the proper domains where they are found.

Target 

detectDomainRecords.target

detectDomainRecords.target
Type

string

Default

empty

Path

plugin.tx_indexedsearch.settings

Target for external URLs.

Default free index UID list 

defaultFreeIndexUidList

defaultFreeIndexUidList
Type

string (list of integers, separated by comma)

Default

empty

Path

plugin.tx_indexedsearch.settings

List of Indexing Configuration UIDs to show as categories in the search form. The order determines the order displayed in the search result.

Exact count 

exactCount

exactCount
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Force permission check for every record while displaying search results. Otherwise, records are only checked up to the current result page, and this might cause that the result counter does not print the exact number of search hits.

By enabling this setting, the loop is not stopped, which causes an exact result count at the cost of an (obvious) slowdown caused by this overhead.

See property show.forbiddenRecords for more information.

Results 

results

results
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

Various crop/offset settings for single result items.

Length of the cropped results title 

results.titleCropAfter

results.titleCropAfter
Type

int

Default

50

Path

plugin.tx_indexedsearch.settings

Determines the length of the cropped title.

Crop signifier for results title 

results.titleCropSignifier

results.titleCropSignifier
Type

string

Default

...

Path

plugin.tx_indexedsearch.settings

Determines the string being appended to a cropped title.

Length of the cropped summary 

results.summaryCropAfter

results.summaryCropAfter
Type

int

Default

180

Path

plugin.tx_indexedsearch.settings

Determines the length of the cropped summary.

Crop signifier for the summary 

results.summaryCropSignifier

results.summaryCropSignifier
Type

string

Default

...

Path

plugin.tx_indexedsearch.settings

Determines the string being appended to a cropped summary.

Length of a summary to highlight search words 

results.markupSW_summaryMax

results.markupSW_summaryMax
Type

int

Default

300

Path

plugin.tx_indexedsearch.settings

Maximum length of a summary to highlight search words in.

Character count next to highlighted search word 

results.markupSW_postPreLgd

results.markupSW_postPreLgd
Type

int

Default

60

Path

plugin.tx_indexedsearch.settings

Determines the amount of characters to keep on both sides of the highlighted search word.

Characters offset from the right side of a highlighted search word 

results.markupSW_postPreLgd_offset

results.markupSW_postPreLgd_offset
Type

int

Default

5

Path

plugin.tx_indexedsearch.settings

Determines the offset of characters from the right side of a highlighted search word. Higher values will "move" the highlighted search word further to the left.

Divider for highlighted search words 

results.markupSW_divider

results.markupSW_divider
Type

string

Default

...

Path

plugin.tx_indexedsearch.settings

Divider for highlighted search words in the summary.

Excludes doktypes in path 

results.pathExcludeDoktypes

results.pathExcludeDoktypes
Type

string

Default

empty

Path

plugin.tx_indexedsearch.settings

Excludes doktypes in rootline.

Example:

plugin.tx_indexedsearch.settings {
    results {
        pathExcludeDoktypes = 254
    }
}
Copied!

Exclude folder (doktype: 254) in path for the result.

/Footer(254)/Navi(254)/Imprint(1) -> /Imprint.

plugin.tx_indexedsearch.settings {
    results {
        pathExcludeDoktypes = 254,4
    }
}
Copied!

Exclude folder (doktype: 254) and shortcuts (doktype: 4) in path for result.

/About-Us(254)/Company(4)/Germany(1) -> /Germany.

Default options 

defaultOptions

defaultOptions
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

Setting of default values.

Please see the options below.

Default: Operand 

defaultOptions.defaultOperand

defaultOptions.defaultOperand
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

0
All words (AND)
1
Any words (OR)

Default: Sections 

defaultOptions.sections

defaultOptions.sections
Type

string (list of integers, separated by comma)

Default
Path

plugin.tx_indexedsearch.settings

Default: Free index UID 

defaultOptions.freeIndexUid

defaultOptions.freeIndexUid
Type

int

Default

-1

Path

plugin.tx_indexedsearch.settings

Default: Media type 

defaultOptions.mediaType

defaultOptions.mediaType
Type

int

Default

-1

Path

plugin.tx_indexedsearch.settings

Default: Sort order 

defaultOptions.sortOrder

defaultOptions.sortOrder
Type

string

Default

rank_flag

Path

plugin.tx_indexedsearch.settings

Default: Language UID 

defaultOptions.languageUid

defaultOptions.languageUid
Type

string

Default

current

Path

plugin.tx_indexedsearch.settings

Default: Sort desc 

defaultOptions.sortDesc

defaultOptions.sortDesc
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

Default: Search type 

defaultOptions.searchType

defaultOptions.searchType
Type

int

Default

1

Path

plugin.tx_indexedsearch.settings

Possible values are 0, 1 (any part of the word), 2, 3, 10 and 20 (sentence).

Default: Extended resume 

defaultOptions.extResume

defaultOptions.extResume
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

Blind 

blind

blind
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

Blinding of option selectors / values in these (advanced search).

Please see the options below.

Blind: Search type 

blind.searchType

blind.searchType
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Default operand 

blind.defaultOperand

blind.defaultOperand
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Sections 

blind.sections

blind.sections
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Free index UID 

blind.freeIndexUid

blind.freeIndexUid
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

Blind: Media type 

blind.mediaType

blind.mediaType
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Sort order 

blind.sortOrder

blind.sortOrder
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Group 

blind.group

blind.group
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Language UID 

blind.languageUid

blind.languageUid
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Desc 

blind.desc

blind.desc
Type

boolean

Default
Path

plugin.tx_indexedsearch.settings

Blind: Number of results 

blind.numberOfResults

blind.numberOfResults
Type

string (list of integers, separated by comma)

Default

10,25,50,100

Path

plugin.tx_indexedsearch.settings

List of amount of results to be displayed per page. Sending a different amount via GET or POST will result in the default value being used to prevent DoS attacks.

Blind: Extended resume 

blind.extResume

blind.extResume
Type

boolean

Default

1

Path

plugin.tx_indexedsearch.settings

Flag rendering 

flagRendering.[languageUid]

flagRendering.[languageUid]
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

FlagRendering is rendered as a TypoScript object and is used to output a flag icon according to the used language of a result item. The ID of the used language (sys_language_uid) is passed as value for "current". This makes it possible to use a CASE TypoScript object to create a separate rendering for each language.

Examples:

plugin.tx_indexedsearch.settings {
    flagRendering = CASE
    flagRendering {
        key.current = 1

        2 = TEXT
        2.value = German

        default = TEXT
        default.value = English
    }
}
Copied!

Icon rendering 

iconRendering.[imageType]

iconRendering.[imageType]
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

iconRendering is rendered as a TypoScript object and is used to output an icon according to the file extension of the file type of the result item, wich is passed as a value for "current". This makes it possible to use a CASE TypoScript object to create a separate rendering for each file type.

Examples:

plugin.tx_indexedsearch.settings {
    iconRendering = CASE
    iconRendering {
        key.current = 1

        default = IMAGE
        default.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pages.gif

        csv = IMAGE
        csv.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/csv.gif

        doc = IMAGE
        doc.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/doc.gif

        docx = IMAGE
        docx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/docx.gif

        dotx = IMAGE
        dotx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/dotx.gif

        html = IMAGE
        html.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/html.gif
        htm < .html

        jpg = IMAGE
        jpg.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/jpg.gif
        jpeg < .jpg

        pdf = IMAGE
        pdf.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pdf.gif

        potx = IMAGE
        potx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/potx.gif

        pps = IMAGE
        pps.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pps.gif

        ppsx = IMAGE
        ppsx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/ppsx.gif

        ppt = IMAGE
        ppt.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/ppt.gif

        pptx = IMAGE
        pptx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pptx.gif

        rtf = IMAGE
        rtf.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/rtf.gif

        sxc = IMAGE
        sxc.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/sxc.gif

        sxi = IMAGE
        sxi.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/sxi.gif

        sxw = IMAGE
        sxw.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/sxw.gif

        tif = IMAGE
        tif.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/tif.gif

        txt = IMAGE
        txt.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/txt.gif

        xls = IMAGE
        xls.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/xls.gif

        xlsx = IMAGE
        xlsx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/xlsx.gif

        xltx = IMAGE
        xltx.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/xltx.gif

        xml = IMAGE
        xml.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/xml.gif

        # The following file types are recognized by the file content parser,
        # but currently there is no image file.

        ods = TEXT
        ods.value = ods

        odp = TEXT
        odp.value = odp

        odt = TEXT
        odt.value = odt
    }
}
Copied!

Special configuration 

specialConfiguration.[pid]

specialConfiguration.[pid]
Type

Array

Default

empty

Path

plugin.tx_indexedsearch.settings

specialConfiguration is an array of objects with properties that can customize certain behaviours of the display of a result row depending on its position in the rootline. For instance, you can define that all results which links to pages in a branch from page ID 123 should have another page icon displayed. Or you can add a suffix to the class names so you can style that section differently.

Examples:

If a page "Contact" is found in a search for "address" and that "Contact" page is in the rootline

Frontpage [ID=23] > About us [ID=45] > Contact [ID=77]

then you should set the pid value to either "77" or "45". If "45" then all subpages including the "About us" page will have similar configuration.

If the pid value is set to 0 (zero), it will apply to all pages.

Please see the options below.

Special configuration page icon 

specialConfiguration.[pid].pageIcon

specialConfiguration.[pid].pageIcon
Type

IMAGE cObject

Default

empty

Path

plugin.tx_indexedsearch.settings

Alternative page icon.

Example:

plugin.tx_indexedsearch.settings {
    specialConfiguration {
        0.pageIcon = IMAGE
        0.pageIcon.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pages.gif

        1.pageIcon = IMAGE
        1.pageIcon.file = EXT:indexed_search/Resources/Public/Icons/FileTypes/pdf.gif
    }
}
Copied!

Special configuration CSS suffix 

specialConfiguration.[pid].CSSsuffix

specialConfiguration.[pid].CSSsuffix
Type

string

Default

empty

Path

plugin.tx_indexedsearch.settings

A string that will be appended to the class names of all the class attributes used within the presentation of the result row. The prefix will be like this:

Example:

plugin.tx_indexedsearch.settings {
    specialConfiguration {
        1.CSSsuffix = doc
    }
}
Copied!

if e.g. the value of CSSsuffix is "doc" then the class name tx-indexedsearch-title will be tx-indexedsearch-title-doc.

[tsref:plugin.tx_indexedsearch]

Fluid Templating 

The plugin "Indexed Search" can be extended with custom templates:

plugin.tx_indexedsearch.view {
    templateRootPaths {
        0 = EXT:indexed_search/Resources/Private/Templates/
        10 = {$plugin.tx_indexedsearch.view.templateRootPath}
        20 = EXT:myextension/Resources/Private/Templates/
    }

    partialRootPaths {
        0 = EXT:indexed_search/Resources/Private/Partials/
        10 = {$plugin.tx_indexedsearch.view.partialRootPath}
        20 = EXT:myextension/Resources/Private/Partials/
    }
}
Copied!

The above configuration will make the plugin look for any template in myextension at the given relative path first and fall back to the default indexed_search template if the configured template cannot be found.

HTML content 

HTML content is weighted by the indexing engine in this order:

  1. <title>-data
  2. <meta-keywords>
  3. <meta-description>
  4. <body>

In addition you can insert markers as HTML comments which define which part of the body-text to include or exclude in the indexing:

The marker is <!--TYPO3SEARCH_begin--> or <!--TYPO3SEARCH_end-->.

Rules:

  1. If there is no marker at all, everything is included.
  2. If the first found marker is an "end" marker, the previous content until that point is included and the preceding code until next "begin" marker is excluded.
  3. If the first found marker is a "begin" marker, the previous content until that point is excluded and preceding content until next "end" marker is included.
  4. If there are multiple marker pairs in HTML, content from in between all pairs is included.

Use of hashes 

The hashes used are md5 hashes where the first 7 chars are converted into an integer which is used as the hash in the database. This is done in order to save space in the database, thus using only 4 bytes and not a varchar of 32 bytes. It's estimated that a hash of 7 chars (32) is sufficient (originally 8, but at some point PHP changed behavior with hexdec-function so that where originally a 32 bit value was input half the values would be negative, they were suddenly positive all of them. That would require a similar change of the fields in the database. To cut it simple, the length was reduced to 7, all being positive then).

How pages are indexed 

First of all a page must be cacheable. For pages where the cache is disabled, no indexing will occur.

The "phash" is a unique identification of a "page" with regard to the indexer. So an entry in the index_phash table equals 1 resultrow in the search-results (called a phash-row).

A phash is a combination of the page-id, type, sys_language id, gr_list, MP and the cHash parameters of the page (function setT3Hashes()). If the phash is made for EXTERNAL media (item_type > 0) then it's a combination of the absolute filename hashes with any "subpage" indication, for instance if a PDF-document is splitted into subsections.

So for external media there is one phash-row for each file (except PDF-files where there may be more). But for TYPO3-pages there can be more phash-rows matching one single page. Obviously the type-parameter would normally always be only one, namely the type-number of the content page. And the cHash may be of importance for the result as well with regard to plugins using that. For instance a message board may make pages cacheable by using the cHash params. If so, each cached page will also be indexed. Thus many phash-rows for a single page-id.

But the most tricky reason for having multiple phash-rows for a single TYPO3-page id is if the gr_list is set! This works like this: If a page has exactly the same content both with and without logins, then it's stored only once! If the page-content differs whether a user is logged in or not - it may even do so based on the fe_groups! - then it's indexed as many times as the content differs. The phash is of course different, but the phash_grouping value is the same.

The table index_grlist will always hold one record per phash-row (of item_type=0, that is TYPO3 pages). But it may also hold many more records. These point to the phash-row in question in the case of other gr_list combinations which actually had the SAME content - and thus refers to the same phash-row.

External media 

External media (pdf, doc, html, txt) is tricky. External media is always detected as links to local files in the content of a TYPO3 page which is being indexed. But external media can the linked to from more than one page. So the index_section table may hold many entries for a single external phash-record, one for each position it's found. Also it's important to notice that external media is only indexed or updated if a "parent" TYPO3 page is re-indexed. Only then will the links to the external files be found. In a searching operation external media will be listed only once (grouping by phash), but say two TYPO3 pages are linking to the document, then only one of them will be shown as the path where the link can be found. However if both TYPO3 pages are not available, then the document will not be shown.

Access restricted pages 

A TYPO3 page will always be available in the search result only if there is access to the page. This is secured in the final result query. Whether extendToSubpages is taken into account depends on the join_pages-flag (see above). But the page will only be listed if the user has access.

However a page may be indexed more than once if the content differs from usergroup to usergroup or just without login. Still the result display will display only one occurrence, because similar pages (determined based on phash_grouping) will be detected.

The tricky scenario 

Say that a page has a content element with some secret information visible for only one usergroup. The page as a whole will be visible for all users. The page will be indexed twice - both without login and with login because page content differs. The problem is that if a search is conducted and matching one of the secret words in the access restricted section, then the page will be in the search result even if the user is not logged in!

The best solution to this problem is to allow the result to be listed anyway, but then HIDE the resume if the index_grlist table cannot confirm positively that the combination of usergroups of the user has access to the result. So the result is there, but no resume shown (The resume might contain hidden text).

External media 

Equally for external media they are linked from a TYPO3 page. When an external media is selected we can be sure that the page linking to it can be selected. But we cannot be sure that the link was in a section accessible for the user. Similarly we should make a lookup in the index_grlist table selecting the phash/gr_list by the phash_t3-value of the section record for the search-result. If this is not available we should not display a link to the document and not show resume, but rather link to the page, from which the user can see the real link to the document.

Analysing the indexed data 

The indexer is constructed to work with TYPO3's page structure. Opposite to a crawler which simply indexes all the pages it can find, the TYPO3 indexer MUST take the following into account:

  • Only cached pages can be indexed.Pages with dynamic content - such as search pages etc - should supply their own search engine for lookup in specific tables. Another option is to selectively allow certain of those "dynamic" pages to be cached anyway (see the static_page_arguments concept used by some plugins)
  • Pages in more than one language must be indexed separately as "different pages".
  • Pages with message boards may have multiple indexed versions based on what is displayed on the page: The overview or a single message board item? This is determined by the static_page_arguments value.
  • Pages with access restricted to must be observed!
  • Because pages can contain different content whether a user is logged in or not and even based on which groups he is a member of, a single page (identified by the combination of id/type/language/static_page_arguments) may even be available in more than one indexed version based on the user-groups. But while the same page may have different content based on the user-groups (and so must be indexed once for each) such pages may just as well present the SAME content regardless of usergroups! This is the very most tricky thing.

Understanding these complex scenarios... 

The best thing to do is to grab an example. Please refer to the picture below while reading the bulletlist here:

  1. The overview in general shows one line per "phash-row" (a single row from the index_phash table). Such a row represents a single hit in a searching session. In other words, each line with grayish background in the overview may be a search-hit. The columns of these rows are:

    • Title: The search-result title.
    • [icon]: Click here to remove the indexed information for this entry (will be re-indexed on the next hit).
    • pHash: The "id" of the search row. The hash is calculated based on id/type/language/MP/static_page_arguments/gr_list of the page when indexed. For external media this is based on filepath/page-interval (for PDF's only)
    • cHash: Calculated based on the actual content which was indexed.
    • rl-012: This is the rootline ids for level 0,1,2. Used when searching in certain sections. For instance a search-operation may select all pages with rl1=123 which will result in a search within pages which exist ONLY in the branch of the website where the level1-page has uid=123.
    • pid.t.l: This is the page-id / type-number / sys_language uid
    • Size: How many bytes the indexed page consumed
    • grlist: This is the gr_list of the user which initiated the indexing operation.
    • static_page_arguments: Additional parameters which are identifying the page in addition to the id/type number which usually does that.
  2. The page "Content elements" has one indexed version. The page-id of the root-page is "1" and the page on level-1 in the rootline had the uid "2". Notice how all subpages to "Content elements" has the exact same rl0 and rl1 value. Where the page "Content elements" does NOT have a value for rl2 so does all the subpages (because they ARE the level 2 themselves!). Furthermore the page has the page-id "2", a type-value of "0" and is indexed with the default language "0". The size was 10.6 KB and the user who initiated the indexing operation was a member of the groups 0,-2,1 (which is effectively fe_group "1", because 0 and -2 is pseudogroups).
  3. On the page "Special content" there must have been a link to a local PDF and Word file, since those two are indexed in relation to this page. The PDF-file is located in the path "uploads/media/tsref_onepage.pdf" relative to the website. Notice that the PDF file is actually indexed three times, one time per page. This is of course configurable. Each indexed section of the PDF-file has the potential to show up as a search-result row of course (because the phash is different per indexed part). The whole point with this is that a large PDF file might contain so much information that it might match all too many search-queries. So breaking a PDF-file down into smaller parts makes it possible for us to indicate exactly WHERE in the PDF-file the search word was found!
  4. Looking at the word file (and the PDF-file as well) we see that they are found on BOTH the page "Special content" and on the page "ISEARCH example". But looking at the phash values (for the word-file it is "268192666") it is the SAME value in both cases. So this means, that the Word and PDF file is indexed only once - when it is first discovered! Later when another page is indexed and a link to the same document appears, then the document is not indexed as another document, but rather an entry in the index_section table is made indicating that this result row is also found available (linked to) from another page/section.Say you are doing a search in the section from "Content elements" and outwards in the page tree. The word- document is matched in the search, but it will appear only once in the search result. Now, if one of the two pages where the Word document was either hidden or access restricted the word-document would still be matched (because one of the pages is accessible for the user). But if BOTH pages with the link to the word document is not accessible for the user doing the search, then the word document will not be included in the search result.
  5. Here we can see that the pages "Special content", "Advanced" and "Menu/Sitemap" is indexed twice each. The reason is that those three pages has had different content depending on whether or not a user was logged in!In the case of the page "Special content" the reason is that the page contained a content element which was visible for users which was a member of group number #1. Therefore the page was different in the two cases.The page "Advanced" has a user-login form and that form looks different whether a user is logged in or not.Finally the page "Menu/Sitemap" apparently changed. There reason was that this page includes a sitemap and that sitemap displayed some extra pages when the logged in users hit the page and so the content was not the same as without login.Another thing which is interesting is that two different users must have visited those pages. We can see that because the page "Special content" was apparently indexed with the usergroup combination "1,2". Later another user hit the page but only a member of group "1". However the page content was the SAME. And because those two users saw the very same page, it was not indexed a third time, but it was instead noted down that a user with membership of only group "1" did also see this same page. That comparison was based on the cHash (contentHash) which is a hash-value based on the actual content being indexed. So when the user with group "1" only came to the page, the indexer engine realize that the page as it looked has already been indexed because another phash-row with that content hash was already available.
  6. These pages does not contain any tricks it appears. According to the grlist's both users with membership of group "1,2" and group "1" only as well as surfers who did not at all login ("0,-1" is the pseudo- group for no login) as visited the page. And because only one indexed version exist the page must have had the same content to present all users regardless of their login-status.The reason why the page "Your own scripts" does not contain a grlist value "0,-2,1,2" as the others do is simply because no user with that combination of usergroups has ever visited the page!
  7. txt and html documents can also be indexed as external media. In the case of HTML-documents the documents <title> is detected and used.
Several complex scenarios

Several complex scenarios

On the image below we are looking at another scenario. In this case the static_page_arguments is obviously used by the plugin "tt_board". The plugin has been constructed so intelligently that it links to the messages in the message board without disabling the normal page-cache but rather sending the tt_board_uid parameter along with a so called "cHash". If this is combined correctly the caching engine allows the page to be cached. Not only does this mean a quicker display of pages in the message board - it also means we can index the page!

Complex scenario with tt_board

Complex scenario with tt_board

As you see the main board page showing the list of messages/threads ("Sourcream and Oni...") is indexed without any values for the parameter tt_board_uid (static_page_arguments). Then it has also been indexed one time for each display of a message. In a search result any of these five rows may appear as an independent result row - after all they are to be regarded as a single page with unique content, despite sharing the same page-id!

Another interesting thing is that while the main page has inherited the page title for the search-result ("Sourcream and ...") each of the indexed pages with a message has got another title - namely the subject line of the message shown! Thus a search matching three of these five pages will not shown three similar page-titles but a unique page title relative to the actual content on the page. It is the tt_board plugin that sets the page-title itself by an API-call.

The only glitch here is that the tt_board plugin has falsely allowed the main page to be cached twice. See the first and last phash-row. The last row has got the parameter "&tt_board_uid= " sent and the tt_board plugin should not have allowed that! Because looking at the content hash of the first and last we realize that it's the SAME hash (84186444) and therefore the SAME content! However being two separate result rows they will both be displayed in the search result as separate hits. The responsibility for this lies with the plugin. However such occurrences can be automatically filtered out during the search result display. But it's better to avoid this kind of stuff.

The last example below has three main issues to discuss:

  1. The page "Other languages" is apparently available in three languages. Which ones are not possible to determine unless we know the value from the sys_languages table. In this case the default language (zero - 0) is english and the language with id 1 and id 2 is danish and german versions of the page.When a search is conducted each page may turn up as a result page but with a little flag telling if the page was found in another language than the main language on the website (see second illustration hereafter)
  2. If there is no phash-rows found for a page this can mean three things:

    1. Either the page is not cached. In this case both the tt_products and tt_news plugins apparently disables the caching of the page thereby disabling any indexing of the pages. Searching in news and products must be done with a searching function looking up directly in the news and products tables.
    2. In the case with other pages the reason may be that the pages has never been visited and therefore not indexed yet! Indexing of pages in TYPO3 happens during the rendering of the page - there is currently no "crawler" to assist this job.
    3. Finally the reason for a page not being indexed can be the combination of 1 and 2: That the page has never been visited. And if it was visited, the cache would have been disabled.
  3. These numbers just tells us that:

    • the page "Lists" was indexed once by a user with membership of group 1 and 2.
    • the page "Addresses" was also indexed by a user with membership of group 1 and 2 but has since been visited by a user without login. Both instances yielded a similar page and it was therefore not indexed twice.This raises the question about the page "Lists": Is that access- restricted for users without login or has a user without login just never visited that page since no "0,-1" grlist has been detected? Both could be the answer. On pages which has access-restriction (or a whole section in an intranet) such pages would obviously not have been indexed by no-login users. However in this case nothing indicates that the page should be hidden for non-login users and so we must conclude that the page has simply not yet been visited by a no-login user - otherwise it would look like the page "Addresses" having also the "0,-1" list detected.
    • The "Guestbook" page was indexed by a user without login only.
More complex scenarios

More complex scenarios

Finally the image below shows how localized versions are displayed in the search results

Localized search results

Localized versions showing up in the search results

index_phash 

This table contains references to TYPO3 pages or external documents. The fields are like this:

phash 

Field
phash
Description

7md5/int hash. It's an integer based on a 7-char md5-hash.

This is a unique representation of the 'page' indexed.

For TYPO3 pages this is a serialization of id,type,gr_list (see later), MP and additional query parameters (which enables 'subcaching' with extra parameters). This concept is also used for TYPO3 caching (although the caching hash includes the all-array and thus takes the template into account, which this hash does not! It's expected that template changes through conditions would not seriously alter the page content)

For external media this is a serialization of 1) unique filename id, 2) any subpage indication (parallel to query parameters). gr_list is NOT taken into consideration here!

phash_grouping 

Field
phash_grouping
Description

7md5/int hash.

This is a non-unique hash exactly like phash, but WITHOUT the gr_list and (in addition) for external media without subpage indication. Thus this field will indicate a 'unique' page (or file) while this page may exist twice or more due to gr_list. Use this field to GROUP BY the search so you get only one hit per page when selecting with gr_list in mind.

Currently a search result does not either group or limit by this, but rather the result display may group the result into logical units.

item_mtime 

Field
item_mtime
Description

Modification time:

For TYPO3 pages: the SYS_LASTCHANGED value

For external media: The filemtime() value.

Depending on config, if mtime hasn't changed compared to this value the file/page is not indexed again.

tstamp 

Field
tstamp
Description

time stamp of the indexing operation. You can configure min/max ages which are checked with this timestamp.

A min-age defines how long an indexed page must be indexed before it's reconsidered to index it again.

A max-age defines an absolute point at which re-indexing will occur (unless the content has not changed according to an md5-hash)

static_page_arguments 

Field
static_page_arguments
Description

The Static Page Arguments - URL parameter that are used for caching.

For TYPO3 pages: These are used to re-generate the actual url of the TYPO3 page in question

For files this is an empty array. Not used.

item_type 

Field
item_type
Description

An integer indicating the content type,

0 is TYPO3 pages

1- external files like pdf (2), doc (3), html (1), txt (4) and so on. See the class.indexer.php file

item_title 

Field
item_title
Description

Title:

For TYPO3 pages, the page title

For files, the basename of the file (no path)

item_description 

Field
item_description
Description
Short description of the item. Top information on the page. Used in search result.

data_page_id 

Field
data_page_id
Description
For TYPO3 pages: The id

data_page_type 

Field
data_page_type
Description
For TYPO3 pages: The type

data_filename 

Field
data_filename
Description
For external files: The filepath (relative) or URL (not used yet)

contentHash 

Field
contentHash
Description
md5 hash of the content indexed. Before reindexing this is compared with the content to be indexed and if it matches there is obviously no need for reindexing.

crdate 

Field
crdate
Description
The creation date of the INDEXING - not the page/file! (see item_crdate)

parsetime 

Field
parsetime
Description
The parsetime of the indexing operation.

sys_language_uid 

Field
sys_language_uid
Description
Will contain the value of GLOBALS["TSFE"]->sys_language_uid, which tells us the language of the page indexed.

item_crdate 

Field
item_crdate
Description
The creation date. For files only the modification date can be read from the files, so here it will be the filemtime().

gr_list 

Field
gr_list
Description
Contains the gr_list of the user initiating the indexing of the document.

index_section 

Points out the section where an entry in index_phash belongs.

phash 

Field
phash
Description
The phash of the indexed document.

phash_t3 

Field
phash_t3
Description

The phash of the "parent" TYPO3 page of the indexed document.

If the "document" being indexed is a TYPO3 page, then phash and phash_t3 are the same.

But if the document is an external file (PDF, Word etc) which are found as a LINK on a TYPO3 page, then this phash_t3 points to the phash of that TYPO3 page. Normally it goes like this when indexing: 1) The TYPO3 document is indexed (this has a phash-value of course), then 2) if any external files are found on the page, they are indexed as well AND their phash_t3 will become the phash of the TYPO3 page they were on.

The significance of this value is that indexed external files may have more than one record in "index_section" (with the same phash), a record for each parent page where a link to the document was found! There are details about this in the section of this document that describes the complexities of indexing pages.

rl0 

Field
rl0
Description
The id of the root-page of the site.

rl1 

Field
rl1
Description
The id of the level-1 page (if any) of the indexed page.

rl2 

Field
rl2
Description
The id of the level-2 page (if any) of the indexed page.

page_id 

Field
page_id
Description
The page id of the indexed page.

uniqid 

Field
uniqid
Description
This is just an autoincremented unique, primary key. Generally not used (i think)

index_fulltext 

For free text searching, e.g. with a sentence, in all content: title, description, keywords, body.

This table is used when basic.useMysqlFulltext extension configuration is enabled.

phash 

Field
phash
Description
The phash of the indexed document.

fulltextdata 

Field
fulltextdata
Description
The total content stripped for any HTML codes.

index_grlist 

This table will hold records related to a phash-row. Records in this table confirms that certain gr_lists would actually share the same content as represented by phash-row - even though the phash-row may be indexed under another login. The table is used during result-display to positively confirm if the current user may see the resume (which otherwise might contain secret info). Please see discussion far above.

index_words, index_rel 

Words-table and word-relation table. Almost self-explanatory.

Both tables are not used when basic.useMysqlFulltext extension configuration is enabled.

For the index_rel table some fields require explanation:

count 

Field
count
Description
Number of occurrences on the page

first 

Field
first
Description
How close to the top (low number is better)

freq 

Field
freq
Description
Frequency (please see source for the calculations. This is converted from some floating point to an integer)

flags 

Field
flags
Description

Bits, which describes the weight of the words:

8th bit (128) = word found in title,

7th bit (64) = word found in keywords,

6th bit (32) = word found in description,

Last 5 bits are not used yet, but if used they will enter the weight hierarchy. The result rows are ordered by this value if the "Weight/Frequency" sorting is selected. Thus results with a hit in the title, keywords or description are ranked higher in the result list.

Known problems 

Searching for hy-phen-at-ed words 

When using the fulltext index feature, searching for words with hyphens in them ("Berners-Lee") will yield no results when MySQL is used as database server. MariaDB does not have this problem.

The reason for this behavior is that the MySQL fulltext parser indexes words with hyphens as two words: "Berners Lee".

Another problem is that the "fulltext search minimum word length" setting ft_min_word_len default value is 4, which means that three-letter words are not indexed at all. Of "Berners-Lee", only "Berners" will be in the index.

Sitemap