DEPRECATION WARNING
This documentation is not using the current rendering mechanism and is probably outdated. The extension maintainer should switch to the new system. Details on how to use the rendering mechanism can be found here.
EXT: marita - A Zend Lucene based search indexer¶
Author: | Kasper Skårhøj |
---|---|
Created: | 2002-11-01T00:32:00 |
Changed by: | Michael Fritz |
Changed: | 2010-02-05T20:09:40.900000000 |
Classification: | marita |
Keywords: | indexer, search, lucene, marita, marit ag, michael fritz |
Author: | Michael Fritz |
Email: | michael.fritz marit.ag |
Info 4: | |
Language: | en |
EXT: marita - A Zend Lucene based search indexer - marita
EXT: marita - A Zend Lucene based search indexer¶
Extension Key: marita
Language: en
Keywords: indexer, search, lucene, marita, marit ag, michael fritz
Copyright 2010-20xx, Michael Fritz, <michael.fritz marit.ag>
This document is published under the Open Content License
available from http://www.opencontent.org/opl.shtml
The content of this document is related to TYPO3
- a GNU/GPL CMS/Framework available from www.typo3.org
Table of Contents¶
EXT: marita - A Zend Lucene based search indexer 1
`Introduction 3 <#1.1.Introduction|outline>`_
`This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology. 3 <#This%20exte nsion%20has%20been%20developed%20by%20TYPO3%20Agentur%20Marit%20AG.%20 It%20is%20a%20parallel%20approach%20to%20the%20search%20solution%20SOL R%20that%20can%20be%20used%20without%20a%20JSP%20Server%20and%20withou t%20any%20other%20technology.|outline>`_
An example, although it's not with TYPO3: http://www.ct- arzneimittel.de/ 3
`Users manual 4 <#1.2.Users%20manual|outline>`_
Set up your domain to crawl somewhere in a sysfolder 4 <#2.Set%20up%2
0your%20domain%20to%20crawl%20somewhere%20in%20a%20sysfolder|outline>
_
`Test the crawler via php /html/typo3/cli_dispatch.phpsh marita run (on mittwald server, its php_cli instead of php) 4 <#2.Test%20the%20c rawler%20via%20php%20/html/typo3/cli_dispatch.phpsh%20marita%20run%20( on%20mittwald%20server,%20its%20php_cli%20instead%20of%20php)|outline> `_
Now you can follow the indexer looking up following directories: 4
/html/typo3conf/ext/marita/cli/lucene/index (the lucene index) 4
/html/typo3conf/ext/marita/cli/lucene/log (Logfiles) 4
Add the extension TYPOScript to your template 4
Now put the search form to your page 4
Now some results should be retrieved to your page. 4
`Administration 6 <#1.3.Administration|outline>`_
`Configuration 7 <#1.4.Configuration|outline>`_
`Known problems 8 <#1.5.Known%20problems|outline>`_
`And of course Todos at the same time: 8 <#And%20of%20course%20Todos%20at%20the%20same%20time:|outline>`_
`ChangeLog 9 <#1.6.ChangeLog|outline>`_
Introduction¶
What does it do?¶
This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology.
- This is a search engine for TYPO3 or other technologies based on PHP Zend Lucene.
- It crawls the visible pages of one or more defined websites - all content and PDF-Files, that can be seen by the public.
- Also other non-TYPO3 pages can be additionally indexed as long as they are set to XHTML (trans|strict) and UTF-8
- Search result pages are weighted by the relevance of keywords.
- Search requests are extreme fast and will be delivered via a Ajax like interface.
Screenshots¶
An example, although it's not with TYPO3: http://www.ct- arzneimittel.de/
A record to set up a domain to crawl.
Users manual¶
- Install the extension
- Set up your domain to crawl somewhere in a sysfolder
- Setup a backend user with the name: _cli_marita (no special rights and password required, just use any password)
- Test the crawler via php /html/typo3/cli_dispatch.phpsh marita run (on mittwald server, its php_cli instead of php)
Now you can follow the indexer looking up following directories:
- /html/typo3conf/ext/marita/cli/lucene/index (the lucene index)
- /html/typo3conf/ext/marita/cli/lucene/log (Logfiles)
- If you see a lot of similar pages in your logfile there is probably something wrong with your link structure. But in this case you should consider to fix this, because the crawler works similar to any other search engine crawler. Duplicated content/URLs is therefore not really helpful > SEO!!! (there is a well known tt_news issue with backPID, which is not recommmended) The crawler tries to find similar pages and stem them together, but whenever there are slightly changes to these pages, the crawler cannot fix this.
- After the crawler finished his job, the index folder will be renamed from index.myrepository.new to index.myrepository. The old folder index.myrepository, if existing, will be deleted. If you stopped the crawler before finishing you could rename the folder manually in order to have some test results.
Add the extension TYPOScript to your template
- Now put the search form to your page
- Now some results should be retrieved to your page.
FAQ¶
Can I change the layout of the searchform?: Yes, just overwrite the lib.marita, and that one can be found at /html/typo3conf/ext/marita/static/lib.marita/setup.txt
Administration¶
- You should set up a cronjob to let your server crawl once a day. The crawler takes about 1 hour per 500 pages:0 0 * * * php /html/typo3/cli_dispatch.phpsh marita run
- The Extension fronted is using jQuery, but only to some selectors, you could easily replace that with a different framework. The file: /html/typo3conf/ext/marita/res/js/jSearch.js
- The template of the search frame is using smarty. The template can be fount at: /html/typo3conf/ext/marita/cli/view/tempaltes/
- You can retrieve the index results with a different frontend function by querying following eid feature: ?eID=marita&searchword= mysearchterm&lang=de&ajax=1&domain=myrepositoryname
- If you're using multiple indices, you can query them by selecting a different repository by changing the constant marita.domain or changing fololwing part in the query string:? eID=marita&searchword=mysearchterm&lang=de&ajax=1&domain= **myrepositoryname**
- You can improve your search results by defining following PHP constants:define('PARSED_AREA_CSSID', 'myDivContainerID');define('NUMBER_OF_PREVIEWCHARS', 200);define('SEARCH_LIMIT', 100);
FAQ¶
- Curl is required
- Zend Lucene is required, but included
- Smarty is required, but included
Configuration¶
Reference¶
Reference (TypoScript constants).
marita.lang¶
Property
marita.lang
Data type
String
Description
Used language, please select from a langugage from following file:/html/typo3conf/ext/marita/cli/lang.php
Default
de
marita.domain¶
Property
marita.domain
Data type
String
Description
Used repository
Default
Marit
marita.searchfieldtext¶
Property
marita.searchfieldtext
Data type
String
Description
Button text
Default
Suchen
Record field explanation
RepositoryName¶
Property
RepositoryName
Data type
String
Description
Just use any name you wish
Example
myrepository
URL¶
Property
URL
Data type
String
Description
The point where the crawler starts to crawl
Example
Pattern¶
Property
Pattern
Data type
String
Description
The string a crawled URL have to match with. All URLS that don't match, won't be indexed and followed.
Example
Marit.ag
Exeptions¶
Property
Exeptions
Data type
String
Description
Commaseparated Strings that make an exeption to the pattern
Example
Blog.marit.ag
Depth in levels¶
Property
Depth in levels
Data type
Int
Description
How many pages will be followed to look for more pages. I use it to avoid loops or infinitive link structures (like calendar day views 'til 1920 :-) )
Increase this factor to get more pages crawled (99 worked for me)!
Example
5
Spellcheck¶
Property
Spellcheck
Data type
Boolean
Description
Check it to get a “did you mean” function. But this function is very much alpha and not really tested yet.
Example
Check!
Known problems¶
And of course Todos at the same time:
- Indexer works only with UTF-8 and XHTML (trans|strict)
- Not all settings can be made via TYPOscript
- Works only with TYPO3 Version <4.2 (eID,CLI)
- If the link structure of your website is not clean, some pages will be shown multiple times
- Login restricted pages cannot be indexed at the moment (got an idea how? Please mail me!).
- Nofollow links should be excluded
- Frontend output is not templated yet.
- Word and PPT files cannot be indexed.