.. You may want to use the usual include line. Uncomment and adjust the path. .. include:: ../Includes.txt .. role:: underline ================ EXT: mnoGoSearch ================ :Created: 2008-11-01T07:51:37 :Changed by: Dmitry Dulepov :Changed: 2009-04-16T14:00:25 :Author: Dmitry Dulepov :Email: dmitry@typo3.org :Info 3: :Info 4: |img-1| |img-2| EXT: mnoGoSearch .. _EXT-mnoGoSearch: EXT: mnoGoSearch ================ Extension Key: **mnogosearch** Copyright 2004-2009, Dmitry Dulepov, This document is published under the Open Content License available from http://www.opencontent.org/opl.shtml The content of this document is related to TYPO3 \- a GNU/GPL CMS/Framework available from www.typo3.org .. _Table-of-Contents: Table of Contents ----------------- `EXT: mnoGoSearch 1 <#1.EXT:%20mnoGoSearch|outline>`_ **`Introduction 3 <#1.1.Introduction|outline>`_** `What does it do? 3 <#1.1.1.What%20does%20it%20do_|outline>`_ `Screenshots 3 <#1.1.2.Screenshots|outline>`_ `Search results in the Frontend 3 <#1.1.2.1.Search%20results%20in%20the%20Frontend|outline>`_ `Configuring what to index 4 <#1.1.2.2.Configuring%20what%20to%20index|outline>`_ `Requirements 5 <#1.1.3.Requirements|outline>`_ `Support for this extension 5 <#1.1.4.Support%20for%20this%20extension|outline>`_ `Translations 5 <#1.1.5.Translations|outline>`_ `Bugs 5 <#1.1.6.Bugs|outline>`_ **`Users manual 6 <#1.2.Users%20manual|outline>`_** `Specifying web space to search 6 <#1.2.1.Specifying%20web%20space%20to%20search|outline>`_ `How mnoGoSearch decides what to index 6 <#1.2.1.1.How%20mnoGoSearch%20decides%20what%20to%20index|outline>`_ `Specifying pages to index 6 <#1.2.1.2.Specifying%20pages%20to%20index|outline>`_ `Excluding parts of the web site from indexing 7 <#1.2.1.3.Excluding%2 0parts%20of%20the%20web%20site%20from%20indexing|outline>`_ `Indexing only real content 7 <#1.2.1.4.Indexing%20only%20real%20content|outline>`_ `Indexing records 8 <#1.2.1.5.Indexing%20records|outline>`_ `Indexing files 9 <#1.2.1.6.Indexing%20files|outline>`_ `Indexing large file collections 9 <#1.2.1.7.Indexing%20large%20file%20collections|outline>`_ `Indexing https pages 9 <#1.2.1.8.Indexing%20https%20pages|outline>`_ `Creating search form 10 <#1.2.2.Creating%20search%20form|outline>`_ `Using TypoScript 10 <#1.2.2.1.Using%20TypoScript|outline>`_ `Using page module 10 <#1.2.2.2.Using%20page%20module|outline>`_ `Using HTML 10 <#1.2.2.3.Using%20HTML|outline>`_ `Creating advanced search form 11 <#1.2.3.Creating%20advanced%20search%20form|outline>`_ `Creating page with search results 11 <#1.2.4.Creating%20page%20with%20search%20results|outline>`_ `Plugin mode 11 <#1.2.4.1.Plugin%20mode|outline>`_ `Limiting search to a certain web space 12 <#1.2.4.2.Limiting%20search %20to%20a%20certain%20web%20space|outline>`_ **`Administration 13 <#1.3.Administration|outline>`_** `Compiling and installing search engine 13 <#1.3.1.Compiling%20and%20installing%20search%20engine|outline>`_ `Compiling and installing PHP extension 14 <#1.3.2.Compiling%20and%20installing%20PHP%20extension|outline>`_ `Creating index database 14 <#1.3.3.Creating%20index%20database|outline>`_ `Using mnogosearch binary and extension supplied with operating system 14 <#1.3.4.Using%20mnogosearch%20binary%20and%20extension%20supplied%2 0with%20operating%20system|outline>`_ `Adding cron job 15 <#1.3.5.Adding%20cron%20job|outline>`_ `Installing TYPO3 extension 15 <#1.3.6.Installing%20TYPO3%20extension|outline>`_ `Configuring Frontend plugin using TypoScript 15 <#1.3.7.Configuring%2 0Frontend%20plugin%20using%20TypoScript|outline>`_ `Using Google Analytics to track your searches 15 <#1.3.8.Using%20Goog le%20Analytics%20to%20track%20your%20searches|outline>`_ `FAQ 15 <#1.3.9.FAQ|outline>`_ `TYPO3SEARCH\_xxx comments are not respected. What is wrong? 15 <#1.3. 9.1.TYPO3SEARCH_xxx%20comments%20are%20not%20respected.%20What%20is%20 wrong_|outline>`_ `I experimented and messed up my index. How do I clear it? 16 <#1.3.9. 2.I%20experimented%20and%20messed%20up%20my%20index.%20How%20do%20I%20 clear%20it_|outline>`_ `I removed a page. How do I remove it from index? 16 <#1.3.9.3.I%20rem oved%20a%20page.%20How%20do%20I%20remove%20it%20from%20index_|outline> `_ `I receive a error “Got error 139 from the database engine” while indexing 16 <#1.3.9.4.I%20receive%20a%20error%20%E2%80%9CGot%20error%2 0139%20from%20the%20database%20engine%E2%80%9D%20while%20indexing|outl ine>`_ `There seems to be a clone of mnoGoSearch called DataParkSearch. What is it? 16 <#1.3.9.5.There%20seems%20to%20be%20a%20clone%20of%20mnoGoSe arch%20called%20DataParkSearch.%20What%20is%20it_|outline>`_ `What does “mnoGoSearch” mean? 16 <#1.3.9.6.What%20does%20%E2%80%9Cmno GoSearch%E2%80%9D%20mean_|outline>`_ **`Configuration 17 <#1.4.Configuration|outline>`_** `TypoScript reference 17 <#1.4.1.TypoScript%20reference|outline>`_ `->FORM 17 <#1.4.1.1.-%3EFORM|outline>`_ `-> ADVANCED 17 <#1.4.1.2.-%3E%20ADVANCED|outline>`_ `-> SELECTOR 17 <#1.4.1.3.-%3E%20SELECTOR|outline>`_ `->SEARCH 17 <#1.4.1.4.-%3ESEARCH|outline>`_ `Command line tool parameters 18 <#1.4.2.Command%20line%20tool%20parameters|outline>`_ **`Tutorial 19 <#1.5.Tutorial|outline>`_** **`Known problems 20 <#1.6.Known%20problems|outline>`_** **`To-Do list 21 <#1.7.To-Do%20list|outline>`_** **`ChangeLog 22 <#1.8.ChangeLog|outline>`_** .. _Introduction: Introduction ------------ .. _What-does-it-do: What does it do? ^^^^^^^^^^^^^^^^ This extension provides an alternative search engine for TYPO3. It features high performance, relevancy, true crawler, searching for word forms (go/goes, man/men), clone detection, suggest mode for misspelled words, great scalability, Google–like look. The extension can be configured to index and search pages, record and files. When searching thousand of pages, the performance of this extension is much better than any other existing TYPO3 search solution known to the author of the extension. This extension requires external software to be installed on the server. The software can be downloaded from the `http://www.mnogosearch.org/ `_ web site. This software is a search engine that works behind this extension and provides indexing and searching services. Additionally mnoGoSearch PHP module is required. This manual contains instructions on building the search engine and PHP extension. Building can be performed even by unexperienced users if the follow instructions exactly. In general mnoGoSearch extension outperforms standard indexed search extension. It is much faster and more feature rich. It has all features of indexed search but much more efficient. .. _Screenshots: Screenshots ^^^^^^^^^^^ This section shows how mnoGoSearch looks like in action. Screenshots in this section come from different sites, therefore visual styling also differs. .. _Search-results-in-the-Frontend: Search results in the Frontend """""""""""""""""""""""""""""" The following screenshot shows search results. Notice file type icon in the first result (OpenOffice document), relevancy indicator (green bar), size and last modification date. The second result did not provide last modification date and it is not displayed in the result. |img-3| The extension uses rich page browser to allow better navigation. Page browser can be customized to show as many page links as necessary: .. _img-4-Configuring-what-to-index: |img-4| Configuring what to index """"""""""""""""""""""""""""""""" The following screenshot shows Backend configuration of web space to be indexed. It says that the whole web site should be indexed: |img-5| Next, parts of the web site are prohibited from being indexed. These pages contain news and FAQ items. We will index them differently. |img-6| Finally we index FAQ items and news. Here is how indexing of news look like: |img-7| The reasons to index news, FAQ and some other records like this will be explained later in this manual. .. _Requirements: Requirements ^^^^^^^^^^^^ mnoGoSearch extension **does not work on Windows servers** because corresponding PHP extension is not available for Windows. It works fine on Linux, Unix, FreeBSD and Mac OS X servers. RealURL or CoolURI is necessary if some parts of the site has to be excluded from search. See “Specifying web space to search” for more information. To compile search engine and PHP extension, :code:`gcc` and accompanying GNU build tools must be installed on the server. .. _Support-for-this-extension: Support for this extension ^^^^^^^^^^^^^^^^^^^^^^^^^^ Free support for this extension is available through TYPO3 mailing lists. Author does not provide free support by e–mail. Commercial support is available on request when time permits. .. _Translations: Translations ^^^^^^^^^^^^ Translation of this extension happens only through TYPO3 translation server. Please, do not send translation to the author as they will not be accepted. Instead contact TYPO3 translators using corresponding TYPO3 mailing list. .. _Bugs: Bugs ^^^^ Bugs must be reported only by using `http://forge.typo3.org/projects /extension-mnogosearch/issues `_ tracker. Bugs must not be sent by e–mail because such e–mails are not processed. .. _Users-manual: Users manual ------------ This section describes how and what end users should do to enable searching web pages using mnoGoSearch. If you are looking for “Quick start”–like guide, you should check the “Tutorial” section first. It describes the workflow to get mnoGoSearch up and running quickly. This section describes various options to search pages. .. _Specifying-web-space-to-search: Specifying web space to search ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This section describes how to specify what the extension will search and index. Often the whole web site can be indexed but sometimes certain parts of the web site should not be indexed or should be indexed in a more efficient manner than just indexing pages. This section explains how to do it all. .. _How-mnoGoSearch-decides-what-to-index: How mnoGoSearch decides what to index """"""""""""""""""""""""""""""""""""" mnoGoSearch sees web sites as a hierarchical structure. When indexing, it needs to know where to start indexing. Typically the start of the hierarchy is the root of the web site (like :code:``http://example.com/ `_` ). But if necessary there can be many starting points (like :code:``http://example.com/products/ `_` and :code:``http://example.com/services/ `_` ). In this case search will be limited to corresponding starting points and everything below them (i.e. :code:``http://example.com/products/navigation/ `_` `) `_ . Any pages outside of the configured starting points are not indexed and therefore not searchable. The important point in the information above is that web site can be indexed as whole ( :code:``http://example.com/ `_` ) or as parts. When indexing as parts, site URLs should be hierarchical, which implies usage of RealURL or CoolURI. When the whole site in indexed, some pages may still need to be excluded. mnoGoSearch provides a way to disallow certain pages from indexing. This can be accomplished by either using :code:`No search` checkbox in page properties. When multiple pages starting from a certain page should not be indexed (like checkout pages), mnoGoSearch allows to disable hierarchies by specifying path to the hierarchy. .. _Specifying-pages-to-index: Specifying pages to index """"""""""""""""""""""""" To specify pages for indexing, an indexing configuration record should be created. While creating these records, it is important to keep in mind that mnoGoSearch works with URL path hierarchies. The first step in specifying pages for indexing is to choose where indexing records are stored. Typically it will be a web site home page or a storage folder. It does not make much difference. However it is good to be consistent and keep all indexing records for a web site on a single page. It allows to see what is actually indexed and what is excluded from indexing. To create indexing configuration record, navigate to the page and use List module to create indexing configuration record. By default records are of the type “Server”. This is a simplest possible type. It specifies indexing starting point as a path withing the web site. For example, to index the full web site, it should be :code:`http://example.com/` (assuming that :code:`example.com` is your web site domain). Note the trailing slash, it is necessary if the URL does not include any other path. Below is how such indexing records look like: |img-8| Additional options include indexing period (24 hours is the default) and “Additional indexing configuration”. The latter allows to enter mnoGoSearch configuration directives directly. They will be appended to the generated indexer configuration. Information about directives can be found at `http://mnogosearch.org/doc33/ `_ . Notice that this field is not validated and any wrong directives will result in fatal error during indexing. The next type of indexing records is a “Realm”. Realm is very similar to “Server” but it allows to use regular expressions or wildcards to specify paths. For example, one can enter :code:`http://example.com/(news\|faq)/.\*` as a path. Make sure that correct comparison type is specified: |img-9| .. _Excluding-parts-of-the-web-site-from-indexing: Excluding parts of the web site from indexing """"""""""""""""""""""""""""""""""""""""""""" To exclude parts of the web site from indexing, create an indexing configuration record as described above bit set method to “Disallow”. It will prohibit any pages starting from the current path from indexing. The the screenshot above (“Real” record). **Notice** that such records should appear in the List module *before records that en* able site indexing. The first record takes precedence when matching URLs. For example, consider http://example.com/page/?excludeMe=1. This order is correct: - Disallow: \*?excludeMe=\* - Allow: http://example.com/ This will first check “disallow” rule. If it matches, it will be used. It means that http://example.com/page/ will be indexed but http://example.com/page/?excludeMe=1 will not. However consider these rules: - Allow: http://example.com/ - Disallow: \*?excludeMe=\* Now both URLs will match because they match to the first rule and “disallow” rule will never work. So when disallowing some pages from being indexed, always put disallow rule *before* the rule that allows indexing. .. _Indexing-only-real-content: Indexing only real content """""""""""""""""""""""""" To improve search relevancy some parts of the page should be excluded from indexing. Such parts include navigation (menu), logo, copyright, statistics, partner links, copyright, etc. Typically only the real content should be included into index. Special HTML comments can be added to the page to tell the indexer what parts of the site should be indexed. There can be many such markers on a single page. Here is a HTML fragment that illustrates how to add such markers: ::
My site
Here goes real web site content...
Here goes another content block...
Copyright © My company.
In the example above content inside TYPO3SEARCH\_xxx will be *indexed* and all links outside of these comments will be *followed* (added to the indexer queue). Notice that there must be no spaces or line breaks in these comments. They must be spelled exactly as shown in the example above. Note that TemplaVoila creates such markers automatically. Other templating engines do not add such markers automatically. .. _Indexing-records: Indexing records """""""""""""""" In certain cases indexing content as pages is not efficient. For example, it is more efficient to index news records as records than as pages. When indexing news as pages, it adds more content than necessary to the index, increase load on the web server and lowers search relevance. When indexing news items as records, mnoGoSearch indexes only title and text fields fields. Thus only true news text is searchable. Same applies to the FAQ (extension :code:`irfaq` ) and some other extensions that store information as records. To index records, indexing configuration for them should be created. To create indexing configuration for records navigate to the page of the web site you have chosen to store indexed configuration at. Then create indexing configuration record and set its type to “Records”. Next choose the table you want to index. The form will refresh. Here is how it will look like of “News” table from :code:`tt\_news` extension is chosen: |img-7| The form requires a title and text fields of the record to be selected. There must be one title field and one or more text fields to index. Text fields will be concatenated together during indexing. Notice that no conversion done on fields. Thus using “Archive date” in the form above will not be useful because this field is stored as integer value in the database. Only true text fields should be selected. Next parameter to specify is URL parameters for the item's single view. For most extensions it looks like :code:`&tx\_extkey\_pi1[showUid]={field:uid}` . For :code:`tt\_news` it looks like shown on the screenshot above. The :code:`&` symbol in the beginning of the parameter is mandatory. :code:`{field:uid}` is replaced uid of the record. No other substitutions available. It is possible to limit indexing to records from the certain storage folder. This way, for example, only news records of the web site will be indexed and not any imported news in another sysfolder. .. _Indexing-files: Indexing files """""""""""""" Indexing files is possible in the same way as indexing pages. Specify correct path to files ( :code:``http://example.com/fileadmin/ `_` and :code:``http://example.com/uploads/ `_` ) to allow indexing them. The rest is done automatically. Directories must show index of files in them (use Apache :code:`mod\_autoindex` ). To index file you successfully you must ensure that file parsing applications (like :code:`catdoc` or :code:`pdftotext` are installed on the server to the default places, normally :code:`/usr/bin` ). Currently mnoGoSearch supports indexing for the following file types: .. ### BEGIN~OF~TABLE ### .. _sxw: sxw ~~~ .. container:: table-row Extension sxw odt Mime type application/vnd.oasis.opendocument.text Requires applications unzip Description OpenOffice document, requires :code:`unzip` to be in the current execution path .. _doc: doc ~~~ .. container:: table-row Extension doc Mime type appication/msword Requires applications catdoc Description Microsoft Office document .. _xls: xls ~~~ .. container:: table-row Extension xls Mime type application/vnd.ms-excel Requires applications xmltohtml .. _ppt: ppt ~~~ .. container:: table-row Extension ppt Mime type application/vnd.ms-powerpoint Requires applications pptohtml .. _pdf: pdf ~~~ .. container:: table-row Extension pdf Mime type application/pdf Requires applications pdftotext Description Adobe PDF .. _txt: txt ~~~ .. container:: table-row Extension txt Mime type text/plain Requires applications Description Plain text .. _html: html ~~~~ .. container:: table-row Extension html Mime type text/html Requires applications Description HTML .. ###### END~OF~TABLE ###### Web servers must be configured to return correct mime type when file is downloaded. With Apache, use :code:`AddType` Apache directive to add mime type: :: AddType application/vnd.oasis.opendocument.text *.sxw AddType application/vnd.oasis.opendocument.text *.odt .. _Indexing-large-file-collections: Indexing large file collections """"""""""""""""""""""""""""""" If number of files is large, it does not make sense to fetch them all using HTTP. In this case an additional directive into the “Additional configuration” field of the indexing configuration for files. This directive will force the indexer to access files locally instead of fetching them through HTTP. Assuming that files are located at :code:``http://example.com/fileadmin/fileserver/ `_` and physically at :code:`/path/to/fileadmin/fileserver/` , the following directive should be added: :: Alias http://example.com/fileadmin/fileserver/ file:///path/to/fileadmin/fileserver/ Notice the correct number of slashes in paths. .. _Indexing-https-pages: Indexing https pages """""""""""""""""""" Indexing https pages with self–signed certificates is not possible directly. mnoGoSearch indexer will refuse to index sucvh pages because it will not see certificate as valid. If obtaining a valid certificate is not an option, there is another way to index such pages. For that an utility named “curl” should be installed on the server. Firsts, navigate to the web site root and execute the following command: :: php typo3/cli_dispatch.phpsh mnogosearch -d | grep X-TYPO3 It will produce the output similar to: :: HTTPHeader "X-TYPO3-mnogosearch: d3e203fdb699f7ba6ad7396fdba5c25a" Note the part in quotes. Next create a new file named :code:`curl.sh` somewhere in the file system. If many sites run on the same host, it makes sense to put this file inside the web site space. Put the following content into this file: :: #!/bin/sh curl -i -k -H "X-TYPO3-mnogosearch: d3e203fdb699f7ba6ad7396fdba5c25a" $1 2>/dev/null Note the part in quotes, it is taken from the output of the previous command. **Do not copy this example!** The header is unique for each site, even for sites running on the same server! -H option adds a special HTTP header to the HTTP request. This header tells the extension that indexer is running. The extension will exclude all content outside of TYPO3SEARCH\_xxx markers from indexed data. See “Indexing only real content” chapter for more information. This script will fetch https pages even if certificate is self–signed. Make this file accessible and executable for the current use only: :: chmod 0700 /path/to/curl.sh **Warning!** Setting permissions like this is extremely important! Neither web server should be able to read this file, nor execute it. If permissions are not set correctly, security of the web site will be compromised! Next add the following lines to the “Additional configuration” of the first indexing configuration you have: :: Alias https:// exec:/path/to/curl.sh?https:// This will call this script for :code:`https://` scheme to fetch pages. Now https pages with self–signed certificates can be indexed too. Make sure that :code:`/path/to/curl.sh` points to the script. :code:`/path/to` above is the placeholder for the real path. .. _Creating-search-form: Creating search form ^^^^^^^^^^^^^^^^^^^^ There are various ways to add search form to the page. You can use one or more ways. If search box appears on each page, TypoScript will work best. However two other options are also available. .. _Using-TypoScript: Using TypoScript """""""""""""""" To create a short simple form (equivalent to “macina\_searchbox” for indexed search) on each page do the following in TypoScript: :: lib.search_form < plugin.tx_mnogosearch_pi1 lib.search_form.mode = short_form Now lib.search\_form can be used for replacing a marker or as a TemplaVoila object. You can also change other options (like template file). See the “Configuration” section later in this manual. Notice that by default form is not cacheable. Non–cacheable form will show search terms in the search box after the submission. This may slightly decrease web site performance. To avoid this the following line can be added after the two lines shown above: :: lib.search_form = USER See also “Using HTML” below to make even better performance when using search forms. .. _Using-page-module: Using page module """"""""""""""""" When using Page module, insert mnoGoSearch plugin to the page and select the desired form in the plugin properties: .. _img-10-Using-HTML: |img-10| Using HTML """"""""""""""""""" Instead of using plugin or TypoScript it is possible to have the form directly in the web site template. The following is the minimum required form mark up for mnoGoSearch: Note that it needs proper “action” URL. If you plan to use Google Analytics to track search results, the method must be “get”. Otherwise it can be “post”. Here is HTML: ::
**This method is recommended** for better web site performance. Plugin for this extension is defined as USER\_INT, which means that the plugin is never cached. Having this plugin on every page may cause a slightly lower web site performance. Notice that using form directly in HTML will show search field in such form empty after submission. .. _Creating-advanced-search-form: Creating advanced search form ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Creating advanced search form needs a little more work in addition to creating simple search form. Currently advanced search form will display only one additional field. This field will allow to select what part of web site is to search. This field is hidden by default in configuration until version 2.1.8 when it will become enabled by default. To enable advanced search form administrator should define more than one indexing configuration for the web site. For example, he can define configurations like “Everywhere” ( :code:`http://example.com/` ), “News only” (table: “News”), “FAQ only” (table: “FAQ”). Next these configurations should be added to the search limit field in the plugin's flexform configuration or their ID values should be added to the TypoScript property named “siteLimits”. Finally the selector should be enabled in TypoScript: :: lib.advanced_form < plugin.tx_mnogosearch_pi1 lib.advanced_form.form.advanced { siteSelector = select } This code will render selector as a HTML :code:`