.. You may want to use the usual include line. Uncomment and adjust the path. 
.. include:: ../Includes.txt

.. role:: underline


================================================
EXT: marita - A Zend Lucene based search indexer
================================================

:Author:
      Kasper Skårhøj

:Created:
      2002-11-01T00:32:00

:Changed by:
      Michael Fritz

:Changed:
      2010-02-05T20:09:40.900000000

:Classification:
      marita

:Keywords:
      indexer, search, lucene, marita, marit ag, michael fritz

:Author:
      Michael Fritz

:Email:
      michael.fritz marit.ag

:Info 4:


:Language:
      en


|img-1| |img-2| EXT: marita - A Zend Lucene based search indexer -
marita


.. _EXT-marita-A-Zend-Lucene-based-search-indexer:

EXT: marita - A Zend Lucene based search indexer
================================================

Extension Key: marita

Language: en

Keywords: indexer, search, lucene, marita, marit ag, michael fritz

Copyright 2010-20xx, Michael Fritz, <michael.fritz marit.ag>

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

\- a GNU/GPL CMS/Framework available from www.typo3.org

.. _Table-of-Contents:

Table of Contents
-----------------

`EXT: marita - A Zend Lucene based search indexer 1 <#1.EXT:%20marita%
20-%20A%20Zend%20Lucene%20based%20search%20indexer|outline>`_

**`Introduction 3 <#1.1.Introduction|outline>`_**

`What does it do? 3 <#1.1.1.What%20does%20it%20do_|outline>`_

**`This extension has been developed by TYPO3 Agentur Marit AG. It is
a parallel approach to the search solution SOLR that can be used
without a JSP Server and without any other technology. 3 <#This%20exte
nsion%20has%20been%20developed%20by%20TYPO3%20Agentur%20Marit%20AG.%20
It%20is%20a%20parallel%20approach%20to%20the%20search%20solution%20SOL
R%20that%20can%20be%20used%20without%20a%20JSP%20Server%20and%20withou
t%20any%20other%20technology.|outline>`_**

`Screenshots 3 <#1.1.2.Screenshots|outline>`_

`An example, although it's not with TYPO3: http://www.ct-
arzneimittel.de/ 3
<#An%20example,%20although%20it's%20not%20with%20TYPO3:%20http://www
.ct-arzneimittel.de/|outline>`_

**`Users manual 4 <#1.2.Users%20manual|outline>`_**

`Set up your domain to crawl somewhere in a sysfolder 4 <#2.Set%20up%2
0your%20domain%20to%20crawl%20somewhere%20in%20a%20sysfolder|outline>`
_

`Setup a backend user with the name: \_cli\_marita (no special rights
and password required, just use any password) 4 <#1.Setup%20a%20backen
d%20user%20with%20the%20name:%20_cli_marita%20(no%20special%20rights%2
0and%20password%20required,%20just%20use%20any%20password)|outline>`_

`Test the crawler via php /html/typo3/cli\_dispatch.phpsh marita run
(on mittwald server, its php\_cli instead of php) 4 <#2.Test%20the%20c
rawler%20via%20php%20/html/typo3/cli_dispatch.phpsh%20marita%20run%20(
on%20mittwald%20server,%20its%20php_cli%20instead%20of%20php)|outline>
`_

`Now you can follow the indexer looking up following directories: 4 <#
1.Now%20you%20can%20follow%20the%20indexer%20looking%20up%20following%
20directories:%20|outline>`_

`/html/typo3conf/ext/marita/cli/lucene/index (the lucene index) 4 <#1.
1./html/typo3conf/ext/marita/cli/lucene/index%20(the%20lucene%20index)
|outline>`_

`/html/typo3conf/ext/marita/cli/lucene/log (Logfiles) 4 <#1.2./html/ty
po3conf/ext/marita/cli/lucene/log%20(Logfiles)|outline>`_

`If you see a lot of similar pages in your logfile there is probably
something wrong with your link structure. But in this case you should
consider to fix this, because the crawler works similar to any other
search engine crawler. Duplicated content/URLs is therefore not really
helpful > SEO!!! (there is a well known tt\_news issue with backPID,
which is not recommmended) The crawler tries to find similar pages and
stem them together, but whenever there are slightly changes to these
pages, the crawler cannot fix this. 4 <#1.3.If%20you%20see%20a%20lot%2
0of%20similar%20pages%20in%20your%20logfile%20there%20is%20probably%20
something%20wrong%20with%20your%20link%20structure.%20But%20in%20this%
20case%20you%20should%20consider%20to%20fix%20this,%20because%20the%20
crawler%20works%20similar%20to%20any%20other%20search%20engine%20crawl
er.%20Duplicated%20content/URLs%20is%20therefore%20not%20really%20help
ful%20%3E%20SEO!!!%20(there%20is%20a%20well%20known%20tt_news%20issue%
20with%20backPID,%20which%20is%20not%20recommmended)%20The%20crawler%2
0tries%20to%20find%20similar%20pages%20and%20stem%20them%20together,%2
0but%20whenever%20there%20are%20slightly%20changes%20to%20these%20page
s,%20the%20crawler%20cannot%20fix%20this.|outline>`_

`After the crawler finished his job, the index folder will be renamed
from index.myrepository.new to index.myrepository. The old folder
index.myrepository, if existing, will be deleted. If you stopped the
crawler before finishing you could rename the folder manually in order
to have some test results. 4 <#1.4.After%20the%20crawler%20finished%20
his%20job,%20the%20index%20folder%20will%20be%20renamed%20from%20index
.myrepository.new%20to%20index.myrepository.%20The%20old%20folder%20in
dex.myrepository,%20if%20existing,%20will%20be%20deleted.%20If%20you%2
0stopped%20the%20crawler%20before%20finishing%20you%20could%20rename%2
0the%20folder%20manually%20in%20order%20to%20have%20some%20test%20resu
lts.|outline>`_

`Add the extension TYPOScript to your template 4 <#2.1.Add%20the%20ext
ension%20TYPOScript%20to%20your%20template|outline>`_

`Now put the search form to your page 4
<#2.Now%20put%20the%20search%20form%20to%20your%20page|outline>`_

`Now some results should be retrieved to your page. 4 <#3.Now%20some%2
0results%20should%20be%20retrieved%20to%20your%20page.|outline>`_

`FAQ 5 <#1.2.1.FAQ|outline>`_

**`Administration 6 <#1.3.Administration|outline>`_**

`FAQ 6 <#1.3.1.FAQ|outline>`_

**`Configuration 7 <#1.4.Configuration|outline>`_**

`Reference 7 <#1.4.1.Reference|outline>`_

**`Known problems 8 <#1.5.Known%20problems|outline>`_**

**`And of course Todos at the same time: 8
<#And%20of%20course%20Todos%20at%20the%20same%20time:|outline>`_**

**`ChangeLog 9 <#1.6.ChangeLog|outline>`_**


.. _Introduction:

Introduction
------------


.. _What-does-it-do:

What does it do?
^^^^^^^^^^^^^^^^

This extension has been developed by TYPO3 Agentur Marit AG. It is a
parallel approach to the search solution SOLR that can be used without
a JSP Server and without any other technology.

- This is a search engine for TYPO3 or other technologies based on PHP
  Zend Lucene.

- It crawls the visible pages of one or more defined websites - all
  content and PDF-Files, that can be seen by the public.

- Also other non-TYPO3 pages can be additionally indexed as long as they
  are set to XHTML (trans\|strict) and UTF-8

- Search result pages are weighted by the relevance of keywords.

- Search requests are extreme fast and will be delivered via a Ajax like
  interface.


.. _Screenshots:

Screenshots
^^^^^^^^^^^

An example, although it's not with TYPO3: `http://www.ct-
arzneimittel.de/ <http://www.ct-arzneimittel.de/>`_

|img-3|

|img-4| A record to set up a domain to crawl.


.. _Users-manual:

Users manual
------------

- Install the extension

- Set up your domain to crawl somewhere in a sysfolder

- |img-5| Setup a backend user with the name: \_cli\_marita (no special
  rights and password required, just use any password)

- Test the crawler via  **php /html/typo3/cli\_dispatch.phpsh marita
  run** (on mittwald server, its php\_cli instead of php)

|img-6| Now you can follow the indexer looking up following
directories:

- /html/typo3conf/ext/marita/cli/lucene/index (the lucene index)

- /html/typo3conf/ext/marita/cli/lucene/log (Logfiles)

- If you see a lot of similar pages in your logfile there is probably
  something wrong with your link structure. But in this case you should
  consider to fix this, because the crawler works similar to any other
  search engine crawler. Duplicated content/URLs is therefore not really
  helpful > SEO!!! (there is a well known tt\_news issue with backPID,
  which is not recommmended) The crawler tries to find similar pages and
  stem them together, but whenever there are slightly changes to these
  pages, the crawler cannot fix this.

- After the crawler finished his job, the index folder will be renamed
  from index.myrepository.new to index.myrepository. The old folder
  index.myrepository, if existing, will be deleted. If you stopped the
  crawler before finishing you could rename the folder manually in order
  to have some test results.

Add the extension TYPOScript to your template

- Now put the search form to your page

- Now some results should be retrieved to your page.

|img-7|


.. _FAQ:

FAQ
^^^

Can I change the layout of the searchform?: Yes, just overwrite the
lib.marita, and that one can be found at
/html/typo3conf/ext/marita/static/lib.marita/setup.txt


.. _Administration:

Administration
--------------

- You should set up a cronjob to let your server crawl once a day. The
  crawler takes about 1 hour per 500 pages:0 0 \* \* \* php
  /html/typo3/cli\_dispatch.phpsh marita run

- The Extension fronted is using jQuery, but only to some selectors, you
  could easily replace that with a different framework. The file:
  /html/typo3conf/ext/marita/res/js/jSearch.js

- The template of the search frame is using smarty. The template can be
  fount at: /html/typo3conf/ext/marita/cli/view/tempaltes/

- You can retrieve the index results with a different frontend function
  by querying following eid feature: :underline:`?eID=marita&searchword=
  mysearchterm&lang=de&ajax=1&domain=myrepositoryname`

- If you're using multiple indices, you can query them by selecting a
  different repository by changing the constant marita.domain or
  changing fololwing part in the query string:?
  :underline:`eID=marita&searchword=mysearchterm&lang=de&ajax=1&domain=`
  :underline:`**myrepositoryname**`

- You can improve your search results by defining following PHP
  constants:define('PARSED\_AREA\_CSSID',
  'myDivContainerID');define('NUMBER\_OF\_PREVIEWCHARS',
  200);define('SEARCH\_LIMIT', 100);


.. _FAQ:

FAQ
^^^

- Curl is required

- Zend Lucene is required, but included

- Smarty is required, but included


.. _Configuration:

Configuration
-------------


.. _Reference:

Reference
^^^^^^^^^

Reference (TypoScript constants).

.. ### BEGIN~OF~TABLE ###


.. _marita-lang:

marita.lang
"""""""""""

.. container:: table-row

   Property
         marita.lang
   
   Data type
         String
   
   Description
         Used language, please select from a langugage from following
         `file:/html/typo3conf/ext/marita/cli/lang.php
         </html/typo3conf/ext/marita/cli/lang.php>`_
   
   Default
         de


.. _marita-domain:

marita.domain
"""""""""""""

.. container:: table-row

   Property
         marita.domain
   
   Data type
         String
   
   Description
         Used repository
   
   Default
         Marit


.. _marita-searchfieldtext:

marita.searchfieldtext
""""""""""""""""""""""

.. container:: table-row

   Property
         marita.searchfieldtext
   
   Data type
         String
   
   Description
         Button text
   
   Default
         Suchen


.. ###### END~OF~TABLE ######

Record field explanation

.. ### BEGIN~OF~TABLE ###


.. _RepositoryName:

RepositoryName
""""""""""""""

.. container:: table-row

   Property
         RepositoryName
   
   Data type
         String
   
   Description
         Just use any name you wish
   
   Example
         myrepository


.. _URL:

URL
"""

.. container:: table-row

   Property
         URL
   
   Data type
         String
   
   Description
         The point where the crawler starts to crawl
   
   Example
         Http://www.marit.ag


.. _Pattern:

Pattern
"""""""

.. container:: table-row

   Property
         Pattern
   
   Data type
         String
   
   Description
         The string a crawled URL have to match with. All URLS that don't
         match, won't be indexed and followed.
   
   Example
         Marit.ag


.. _Exeptions:

Exeptions
"""""""""

.. container:: table-row

   Property
         Exeptions
   
   Data type
         String
   
   Description
         Commaseparated Strings that make an exeption to the pattern
   
   Example
         Blog.marit.ag


.. _Depth-in-levels:

Depth in levels
"""""""""""""""

.. container:: table-row

   Property
         Depth in levels
   
   Data type
         Int
   
   Description
         How many pages will be followed to look for more pages. I use it to
         avoid loops or infinitive link structures (like calendar day views
         'til 1920 :-) )
         
         Increase this factor to get more pages crawled (99 worked for me)!
   
   Example
         5


.. _Spellcheck:

Spellcheck
""""""""""

.. container:: table-row

   Property
         Spellcheck
   
   Data type
         Boolean
   
   Description
         Check it to get a “did you mean” function. But this function is very
         much alpha and not really tested yet.
   
   Example
         Check!


.. ###### END~OF~TABLE ######


.. _Known-problems:

Known problems
--------------

And of course Todos at the same time:

- Indexer works only with UTF-8 and XHTML (trans\|strict)

- Not all settings can be made via TYPOscript

- Works only with TYPO3 Version <4.2 (eID,CLI)

- If the link structure of your website is not clean, some pages will be
  shown multiple times

- Login restricted pages cannot be indexed at the moment (got an idea
  how? Please mail me!).

- Nofollow links should be excluded

- Frontend output is not templated yet.

- Word and PPT files cannot be indexed.


.. _ChangeLog:

ChangeLog
---------

- Initial version

- Some improvements

- Manual added

9


.. ######CUTTER_MARK_IMAGES######

.. |img-1| image:: img-1.png
.. :align: left

.. |img-2| image:: img-2.png
.. :border: 0
.. :height: 21
.. :hspace: 9
.. :id: Grafik2
.. :name: Grafik2
.. :width: 87

.. |img-3| image:: img-3.png
.. :align: left
.. :border: 0
.. :height: 279
.. :id: Grafik1
.. :name: Grafik1
.. :width: 454

.. |img-4| image:: img-4.png
.. :align: left
.. :border: 0
.. :height: 197
.. :id: Grafik3
.. :name: Grafik3
.. :width: 269

.. |img-5| image:: img-5.png
.. :align: left
.. :border: 0
.. :height: 373
.. :id: Grafik5
.. :name: Grafik5
.. :width: 655

.. |img-6| image:: img-6.png
.. :align: left
.. :border: 0
.. :height: 61
.. :id: Grafik6
.. :name: Grafik6
.. :width: 504

.. |img-7| image:: img-7.png
.. :align: left
.. :border: 0
.. :height: 234
.. :id: Grafik7
.. :name: Grafik7
.. :width: 420