DEPRECATION WARNING

This documentation is not using the current rendering mechanism and is probably outdated. The extension maintainer should switch to the new system. Details on how to use the rendering mechanism can be found here.

EXT: marita - A Zend Lucene based search indexer

Author:Kasper Skårhøj
Created:2002-11-01T00:32:00
Changed by:Michael Fritz
Changed:2010-02-05T20:09:40.900000000
Classification:marita
Keywords:indexer, search, lucene, marita, marit ag, michael fritz
Author:Michael Fritz
Email:michael.fritz marit.ag
Info 4:
Language:en

img-1 img-2 EXT: marita - A Zend Lucene based search indexer - marita

EXT: marita - A Zend Lucene based search indexer

Extension Key: marita

Language: en

Keywords: indexer, search, lucene, marita, marit ag, michael fritz

Copyright 2010-20xx, Michael Fritz, <michael.fritz marit.ag>

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

- a GNU/GPL CMS/Framework available from www.typo3.org

Table of Contents

EXT: marita - A Zend Lucene based search indexer 1

`Introduction 3 <#1.1.Introduction|outline>`_

What does it do? 3

`This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology. 3 <#This%20exte nsion%20has%20been%20developed%20by%20TYPO3%20Agentur%20Marit%20AG.%20 It%20is%20a%20parallel%20approach%20to%20the%20search%20solution%20SOL R%20that%20can%20be%20used%20without%20a%20JSP%20Server%20and%20withou t%20any%20other%20technology.|outline>`_

Screenshots 3

An example, although it's not with TYPO3: http://www.ct- arzneimittel.de/ 3

`Users manual 4 <#1.2.Users%20manual|outline>`_

Set up your domain to crawl somewhere in a sysfolder 4 <#2.Set%20up%2 0your%20domain%20to%20crawl%20somewhere%20in%20a%20sysfolder|outline> _

Setup a backend user with the name: _cli_marita (no special rights and password required, just use any password) 4

`Test the crawler via php /html/typo3/cli_dispatch.phpsh marita run (on mittwald server, its php_cli instead of php) 4 <#2.Test%20the%20c rawler%20via%20php%20/html/typo3/cli_dispatch.phpsh%20marita%20run%20( on%20mittwald%20server,%20its%20php_cli%20instead%20of%20php)|outline> `_

Now you can follow the indexer looking up following directories: 4

/html/typo3conf/ext/marita/cli/lucene/index (the lucene index) 4

/html/typo3conf/ext/marita/cli/lucene/log (Logfiles) 4

If you see a lot of similar pages in your logfile there is probably something wrong with your link structure. But in this case you should consider to fix this, because the crawler works similar to any other search engine crawler. Duplicated content/URLs is therefore not really helpful > SEO!!! (there is a well known tt_news issue with backPID, which is not recommmended) The crawler tries to find similar pages and stem them together, but whenever there are slightly changes to these pages, the crawler cannot fix this. 4

After the crawler finished his job, the index folder will be renamed from index.myrepository.new to index.myrepository. The old folder index.myrepository, if existing, will be deleted. If you stopped the crawler before finishing you could rename the folder manually in order to have some test results. 4

Add the extension TYPOScript to your template 4

Now put the search form to your page 4

Now some results should be retrieved to your page. 4

FAQ 5

`Administration 6 <#1.3.Administration|outline>`_

FAQ 6

`Configuration 7 <#1.4.Configuration|outline>`_

Reference 7

`Known problems 8 <#1.5.Known%20problems|outline>`_

`And of course Todos at the same time: 8 <#And%20of%20course%20Todos%20at%20the%20same%20time:|outline>`_

`ChangeLog 9 <#1.6.ChangeLog|outline>`_

Introduction

What does it do?

This extension has been developed by TYPO3 Agentur Marit AG. It is a parallel approach to the search solution SOLR that can be used without a JSP Server and without any other technology.

  • This is a search engine for TYPO3 or other technologies based on PHP Zend Lucene.
  • It crawls the visible pages of one or more defined websites - all content and PDF-Files, that can be seen by the public.
  • Also other non-TYPO3 pages can be additionally indexed as long as they are set to XHTML (trans|strict) and UTF-8
  • Search result pages are weighted by the relevance of keywords.
  • Search requests are extreme fast and will be delivered via a Ajax like interface.

Screenshots

An example, although it's not with TYPO3: http://www.ct- arzneimittel.de/

img-3

img-4 A record to set up a domain to crawl.

Users manual

  • Install the extension
  • Set up your domain to crawl somewhere in a sysfolder
  • img-5 Setup a backend user with the name: _cli_marita (no special rights and password required, just use any password)
  • Test the crawler via php /html/typo3/cli_dispatch.phpsh marita run (on mittwald server, its php_cli instead of php)

img-6 Now you can follow the indexer looking up following directories:

  • /html/typo3conf/ext/marita/cli/lucene/index (the lucene index)
  • /html/typo3conf/ext/marita/cli/lucene/log (Logfiles)
  • If you see a lot of similar pages in your logfile there is probably something wrong with your link structure. But in this case you should consider to fix this, because the crawler works similar to any other search engine crawler. Duplicated content/URLs is therefore not really helpful > SEO!!! (there is a well known tt_news issue with backPID, which is not recommmended) The crawler tries to find similar pages and stem them together, but whenever there are slightly changes to these pages, the crawler cannot fix this.
  • After the crawler finished his job, the index folder will be renamed from index.myrepository.new to index.myrepository. The old folder index.myrepository, if existing, will be deleted. If you stopped the crawler before finishing you could rename the folder manually in order to have some test results.

Add the extension TYPOScript to your template

  • Now put the search form to your page
  • Now some results should be retrieved to your page.

img-7

FAQ

Can I change the layout of the searchform?: Yes, just overwrite the lib.marita, and that one can be found at /html/typo3conf/ext/marita/static/lib.marita/setup.txt

Administration

  • You should set up a cronjob to let your server crawl once a day. The crawler takes about 1 hour per 500 pages:0 0 * * * php /html/typo3/cli_dispatch.phpsh marita run
  • The Extension fronted is using jQuery, but only to some selectors, you could easily replace that with a different framework. The file: /html/typo3conf/ext/marita/res/js/jSearch.js
  • The template of the search frame is using smarty. The template can be fount at: /html/typo3conf/ext/marita/cli/view/tempaltes/
  • You can retrieve the index results with a different frontend function by querying following eid feature: ?eID=marita&searchword= mysearchterm&lang=de&ajax=1&domain=myrepositoryname
  • If you're using multiple indices, you can query them by selecting a different repository by changing the constant marita.domain or changing fololwing part in the query string:? eID=marita&searchword=mysearchterm&lang=de&ajax=1&domain= **myrepositoryname**
  • You can improve your search results by defining following PHP constants:define('PARSED_AREA_CSSID', 'myDivContainerID');define('NUMBER_OF_PREVIEWCHARS', 200);define('SEARCH_LIMIT', 100);

FAQ

  • Curl is required
  • Zend Lucene is required, but included
  • Smarty is required, but included

Configuration

Reference

Reference (TypoScript constants).

marita.lang

Property

marita.lang

Data type

String

Description

Used language, please select from a langugage from following file:/html/typo3conf/ext/marita/cli/lang.php

Default

de

marita.domain

Property

marita.domain

Data type

String

Description

Used repository

Default

Marit

marita.searchfieldtext

Property

marita.searchfieldtext

Data type

String

Description

Button text

Default

Suchen

Record field explanation

RepositoryName

Property

RepositoryName

Data type

String

Description

Just use any name you wish

Example

myrepository

URL

Property

URL

Data type

String

Description

The point where the crawler starts to crawl

Pattern

Property

Pattern

Data type

String

Description

The string a crawled URL have to match with. All URLS that don't match, won't be indexed and followed.

Example

Marit.ag

Exeptions

Property

Exeptions

Data type

String

Description

Commaseparated Strings that make an exeption to the pattern

Example

Blog.marit.ag

Depth in levels

Property

Depth in levels

Data type

Int

Description

How many pages will be followed to look for more pages. I use it to avoid loops or infinitive link structures (like calendar day views 'til 1920 :-) )

Increase this factor to get more pages crawled (99 worked for me)!

Example

5

Spellcheck

Property

Spellcheck

Data type

Boolean

Description

Check it to get a “did you mean” function. But this function is very much alpha and not really tested yet.

Example

Check!

Known problems

And of course Todos at the same time:

  • Indexer works only with UTF-8 and XHTML (trans|strict)
  • Not all settings can be made via TYPOscript
  • Works only with TYPO3 Version <4.2 (eID,CLI)
  • If the link structure of your website is not clean, some pages will be shown multiple times
  • Login restricted pages cannot be indexed at the moment (got an idea how? Please mail me!).
  • Nofollow links should be excluded
  • Frontend output is not templated yet.
  • Word and PPT files cannot be indexed.

ChangeLog

  • Initial version
  • Some improvements
  • Manual added

9