Apache Tika for TYPO3¶
- Classification
tika
- Version
11.0.1
- Language
en
- Description
Apache Tika for TYPO3
- Keywords
apache, tika, meta, data, DAM, files, FAL, solr, server, language, content, detection, extraction
- Copyright
since 2009
- Author
Ingo Renner
- License
This document is published under the Open Content License available from http://www.opencontent.org/opl.shtml
- Rendered
2024-05-03 16:11
The content of this document is related to TYPO3, a GNU/GPL CMS/Framework available from typo3.org.
What does it do?¶
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
All in all Tika knows/can detect about 1200 file formats and can read about half of them. These formats include the most common ones: HTML, XML including RSS and ATOM feeds, Microsoft Office (binary formats and OOXML), OpenDocument (OpenOffice.org), Apple iWork, PDF, ePUB, RTF, compressed formats like ZIP, audio formats including MP3, flash flv video, image formats including JPEG and TIFF, mail box mbox format, and many more.
Apache Tika for TYPO3 provides three services to retrieve information from files:
Text extraction
Language detection of file contents
Meta data extraction
All three services can be used with FAL.
It is recommended to use Apache Tika version 1.11 or higher.
Configuration¶
All the settings for the extension can be made through the TYPO3 Extension Manager. Simply select what service you would like to use, either Tika App, Tika Server or Solr Server. Depending on that, configure the necessary settings for your service on the according settings tab.
When done, check the TYPO3 system status report to validate your settings.
Configuration of Tika App¶
General information about how to configure the Tika App can be found in the official documentation
In case you want to exclude certain mime types from beeing processed by Tika, you can do the following:
Create the file /etc/tika/tika-config.xml
with this content:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/zip</mime-exclude>
</parser>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/zip</mime>
</parser>
</parsers>
</properties>
This tells Tika to exclude zip files from DefaultParser and use EmptyParser instead, who does basically nothing.
Then add one line to /etc/security/pam_env.con
:
TIKA_CONFIG DEFAULT="/etc/tika/tika-config.xml"
This sets a global environment variable where Tika shold look for more configuration.
Getting Help¶
First check the TYPO3 system status report for any errors reported by the extension. You will find them as reported from Apache Tika. The extension checks whether you have Java installed when using the Tika app or Tika server. It will also check your configuration, whether the configured paths for Tika app and Tika server are available and whether Tika Server and Solr server can be reached depending on what you're using.
If you run into any issues with setting up EXT:tika don't hesitate to ask for help on the TYPO3 Solr Slack channel