.. You may want to use the usual include line. Uncomment and adjust the path. .. include:: ../Includes.txt .. role:: underline ================================================ EXT: marita - A Zend Lucene based search indexer ================================================ :Author: Kasper Skårhøj :Created: 2002-11-01T00:32:00 :Changed by: Michael Fritz :Changed: 2010-02-05T20:09:40.900000000 :Classification: marita :Keywords: indexer, search, lucene, marita, marit ag, michael fritz :Author: Michael Fritz :Email: michael.fritz marit.ag :Info 4: :Language: en |img-1| |img-2| EXT: marita - A Zend Lucene based search indexer - marita .. _EXT-marita-A-Zend-Lucene-based-search-indexer: EXT: marita - A Zend Lucene based search indexer ================================================ Extension Key: marita Language: en Keywords: indexer, search, lucene, marita, marit ag, michael fritz Copyright 2010-20xx, Michael Fritz, This document is published under the Open Content License available from http://www.opencontent.org/opl.shtml The content of this document is related to TYPO3 \- a GNU/GPL CMS/Framework available from www.typo3.org .. _Table-of-Contents: Table of Contents ----------------- `EXT: marita - A Zend Lucene based search indexer 1 <#1.EXT:%20marita% 20-%20A%20Zend%20Lucene%20based%20search%20indexer|outline>`_ **`Introduction 3 <#1.1.Introduction|outline>`_** `What does it do? 3 <#1.1.1.What%20does%20it%20do_|outline>`_ **`This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology. 3 <#This%20exte nsion%20has%20been%20developed%20by%20TYPO3%20Agentur%20Marit%20AG.%20 It%20is%20a%20parallel%20approach%20to%20the%20search%20solution%20SOL R%20that%20can%20be%20used%20without%20a%20JSP%20Server%20and%20withou t%20any%20other%20technology.|outline>`_** `Screenshots 3 <#1.1.2.Screenshots|outline>`_ `An example, although it's not with TYPO3: http://www.ct- arzneimittel.de/ 3 <#An%20example,%20although%20it's%20not%20with%20TYPO3:%20http://www .ct-arzneimittel.de/|outline>`_ **`Users manual 4 <#1.2.Users%20manual|outline>`_** `Set up your domain to crawl somewhere in a sysfolder 4 <#2.Set%20up%2 0your%20domain%20to%20crawl%20somewhere%20in%20a%20sysfolder|outline>` _ `Setup a backend user with the name: \_cli\_marita (no special rights and password required, just use any password) 4 <#1.Setup%20a%20backen d%20user%20with%20the%20name:%20_cli_marita%20(no%20special%20rights%2 0and%20password%20required,%20just%20use%20any%20password)|outline>`_ `Test the crawler via php /html/typo3/cli\_dispatch.phpsh marita run (on mittwald server, its php\_cli instead of php) 4 <#2.Test%20the%20c rawler%20via%20php%20/html/typo3/cli_dispatch.phpsh%20marita%20run%20( on%20mittwald%20server,%20its%20php_cli%20instead%20of%20php)|outline> `_ `Now you can follow the indexer looking up following directories: 4 <# 1.Now%20you%20can%20follow%20the%20indexer%20looking%20up%20following% 20directories:%20|outline>`_ `/html/typo3conf/ext/marita/cli/lucene/index (the lucene index) 4 <#1. 1./html/typo3conf/ext/marita/cli/lucene/index%20(the%20lucene%20index) |outline>`_ `/html/typo3conf/ext/marita/cli/lucene/log (Logfiles) 4 <#1.2./html/ty po3conf/ext/marita/cli/lucene/log%20(Logfiles)|outline>`_ `If you see a lot of similar pages in your logfile there is probably something wrong with your link structure. But in this case you should consider to fix this, because the crawler works similar to any other search engine crawler. Duplicated content/URLs is therefore not really helpful > SEO!!! (there is a well known tt\_news issue with backPID, which is not recommmended) The crawler tries to find similar pages and stem them together, but whenever there are slightly changes to these pages, the crawler cannot fix this. 4 <#1.3.If%20you%20see%20a%20lot%2 0of%20similar%20pages%20in%20your%20logfile%20there%20is%20probably%20 something%20wrong%20with%20your%20link%20structure.%20But%20in%20this% 20case%20you%20should%20consider%20to%20fix%20this,%20because%20the%20 crawler%20works%20similar%20to%20any%20other%20search%20engine%20crawl er.%20Duplicated%20content/URLs%20is%20therefore%20not%20really%20help ful%20%3E%20SEO!!!%20(there%20is%20a%20well%20known%20tt_news%20issue% 20with%20backPID,%20which%20is%20not%20recommmended)%20The%20crawler%2 0tries%20to%20find%20similar%20pages%20and%20stem%20them%20together,%2 0but%20whenever%20there%20are%20slightly%20changes%20to%20these%20page s,%20the%20crawler%20cannot%20fix%20this.|outline>`_ `After the crawler finished his job, the index folder will be renamed from index.myrepository.new to index.myrepository. The old folder index.myrepository, if existing, will be deleted. If you stopped the crawler before finishing you could rename the folder manually in order to have some test results. 4 <#1.4.After%20the%20crawler%20finished%20 his%20job,%20the%20index%20folder%20will%20be%20renamed%20from%20index .myrepository.new%20to%20index.myrepository.%20The%20old%20folder%20in dex.myrepository,%20if%20existing,%20will%20be%20deleted.%20If%20you%2 0stopped%20the%20crawler%20before%20finishing%20you%20could%20rename%2 0the%20folder%20manually%20in%20order%20to%20have%20some%20test%20resu lts.|outline>`_ `Add the extension TYPOScript to your template 4 <#2.1.Add%20the%20ext ension%20TYPOScript%20to%20your%20template|outline>`_ `Now put the search form to your page 4 <#2.Now%20put%20the%20search%20form%20to%20your%20page|outline>`_ `Now some results should be retrieved to your page. 4 <#3.Now%20some%2 0results%20should%20be%20retrieved%20to%20your%20page.|outline>`_ `FAQ 5 <#1.2.1.FAQ|outline>`_ **`Administration 6 <#1.3.Administration|outline>`_** `FAQ 6 <#1.3.1.FAQ|outline>`_ **`Configuration 7 <#1.4.Configuration|outline>`_** `Reference 7 <#1.4.1.Reference|outline>`_ **`Known problems 8 <#1.5.Known%20problems|outline>`_** **`And of course Todos at the same time: 8 <#And%20of%20course%20Todos%20at%20the%20same%20time:|outline>`_** **`ChangeLog 9 <#1.6.ChangeLog|outline>`_** .. _Introduction: Introduction ------------ .. _What-does-it-do: What does it do? ^^^^^^^^^^^^^^^^ This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology. - This is a search engine for TYPO3 or other technologies based on PHP Zend Lucene. - It crawls the visible pages of one or more defined websites - all content and PDF-Files, that can be seen by the public. - Also other non-TYPO3 pages can be additionally indexed as long as they are set to XHTML (trans\|strict) and UTF-8 - Search result pages are weighted by the relevance of keywords. - Search requests are extreme fast and will be delivered via a Ajax like interface. .. _Screenshots: Screenshots ^^^^^^^^^^^ An example, although it's not with TYPO3: `http://www.ct- arzneimittel.de/ `_ |img-3| |img-4| A record to set up a domain to crawl. .. _Users-manual: Users manual ------------ - Install the extension - Set up your domain to crawl somewhere in a sysfolder - |img-5| Setup a backend user with the name: \_cli\_marita (no special rights and password required, just use any password) - Test the crawler via **php /html/typo3/cli\_dispatch.phpsh marita run** (on mittwald server, its php\_cli instead of php) |img-6| Now you can follow the indexer looking up following directories: - /html/typo3conf/ext/marita/cli/lucene/index (the lucene index) - /html/typo3conf/ext/marita/cli/lucene/log (Logfiles) - If you see a lot of similar pages in your logfile there is probably something wrong with your link structure. But in this case you should consider to fix this, because the crawler works similar to any other search engine crawler. Duplicated content/URLs is therefore not really helpful > SEO!!! (there is a well known tt\_news issue with backPID, which is not recommmended) The crawler tries to find similar pages and stem them together, but whenever there are slightly changes to these pages, the crawler cannot fix this. - After the crawler finished his job, the index folder will be renamed from index.myrepository.new to index.myrepository. The old folder index.myrepository, if existing, will be deleted. If you stopped the crawler before finishing you could rename the folder manually in order to have some test results. Add the extension TYPOScript to your template - Now put the search form to your page - Now some results should be retrieved to your page. |img-7| .. _FAQ: FAQ ^^^ Can I change the layout of the searchform?: Yes, just overwrite the lib.marita, and that one can be found at /html/typo3conf/ext/marita/static/lib.marita/setup.txt .. _Administration: Administration -------------- - You should set up a cronjob to let your server crawl once a day. The crawler takes about 1 hour per 500 pages:0 0 \* \* \* php /html/typo3/cli\_dispatch.phpsh marita run - The Extension fronted is using jQuery, but only to some selectors, you could easily replace that with a different framework. The file: /html/typo3conf/ext/marita/res/js/jSearch.js - The template of the search frame is using smarty. The template can be fount at: /html/typo3conf/ext/marita/cli/view/tempaltes/ - You can retrieve the index results with a different frontend function by querying following eid feature: :underline:`?eID=marita&searchword= mysearchterm&lang=de&ajax=1&domain=myrepositoryname` - If you're using multiple indices, you can query them by selecting a different repository by changing the constant marita.domain or changing fololwing part in the query string:? :underline:`eID=marita&searchword=mysearchterm&lang=de&ajax=1&domain=` :underline:`**myrepositoryname**` - You can improve your search results by defining following PHP constants:define('PARSED\_AREA\_CSSID', 'myDivContainerID');define('NUMBER\_OF\_PREVIEWCHARS', 200);define('SEARCH\_LIMIT', 100); .. _FAQ: FAQ ^^^ - Curl is required - Zend Lucene is required, but included - Smarty is required, but included .. _Configuration: Configuration ------------- .. _Reference: Reference ^^^^^^^^^ Reference (TypoScript constants). .. ### BEGIN~OF~TABLE ### .. _marita-lang: marita.lang """"""""""" .. container:: table-row Property marita.lang Data type String Description Used language, please select from a langugage from following `file:/html/typo3conf/ext/marita/cli/lang.php `_ Default de .. _marita-domain: marita.domain """"""""""""" .. container:: table-row Property marita.domain Data type String Description Used repository Default Marit .. _marita-searchfieldtext: marita.searchfieldtext """""""""""""""""""""" .. container:: table-row Property marita.searchfieldtext Data type String Description Button text Default Suchen .. ###### END~OF~TABLE ###### Record field explanation .. ### BEGIN~OF~TABLE ### .. _RepositoryName: RepositoryName """""""""""""" .. container:: table-row Property RepositoryName Data type String Description Just use any name you wish Example myrepository .. _URL: URL """ .. container:: table-row Property URL Data type String Description The point where the crawler starts to crawl Example Http://www.marit.ag .. _Pattern: Pattern """"""" .. container:: table-row Property Pattern Data type String Description The string a crawled URL have to match with. All URLS that don't match, won't be indexed and followed. Example Marit.ag .. _Exeptions: Exeptions """"""""" .. container:: table-row Property Exeptions Data type String Description Commaseparated Strings that make an exeption to the pattern Example Blog.marit.ag .. _Depth-in-levels: Depth in levels """"""""""""""" .. container:: table-row Property Depth in levels Data type Int Description How many pages will be followed to look for more pages. I use it to avoid loops or infinitive link structures (like calendar day views 'til 1920 :-) ) Increase this factor to get more pages crawled (99 worked for me)! Example 5 .. _Spellcheck: Spellcheck """""""""" .. container:: table-row Property Spellcheck Data type Boolean Description Check it to get a “did you mean” function. But this function is very much alpha and not really tested yet. Example Check! .. ###### END~OF~TABLE ###### .. _Known-problems: Known problems -------------- And of course Todos at the same time: - Indexer works only with UTF-8 and XHTML (trans\|strict) - Not all settings can be made via TYPOscript - Works only with TYPO3 Version <4.2 (eID,CLI) - If the link structure of your website is not clean, some pages will be shown multiple times - Login restricted pages cannot be indexed at the moment (got an idea how? Please mail me!). - Nofollow links should be excluded - Frontend output is not templated yet. - Word and PPT files cannot be indexed. .. _ChangeLog: ChangeLog --------- - Initial version - Some improvements - Manual added 9 .. ######CUTTER_MARK_IMAGES###### .. |img-1| image:: img-1.png .. :align: left .. |img-2| image:: img-2.png .. :border: 0 .. :height: 21 .. :hspace: 9 .. :id: Grafik2 .. :name: Grafik2 .. :width: 87 .. |img-3| image:: img-3.png .. :align: left .. :border: 0 .. :height: 279 .. :id: Grafik1 .. :name: Grafik1 .. :width: 454 .. |img-4| image:: img-4.png .. :align: left .. :border: 0 .. :height: 197 .. :id: Grafik3 .. :name: Grafik3 .. :width: 269 .. |img-5| image:: img-5.png .. :align: left .. :border: 0 .. :height: 373 .. :id: Grafik5 .. :name: Grafik5 .. :width: 655 .. |img-6| image:: img-6.png .. :align: left .. :border: 0 .. :height: 61 .. :id: Grafik6 .. :name: Grafik6 .. :width: 504 .. |img-7| image:: img-7.png .. :align: left .. :border: 0 .. :height: 234 .. :id: Grafik7 .. :name: Grafik7 .. :width: 420