DEPRECATION WARNING

This documentation is not using the current rendering mechanism and is probably outdated. The extension maintainer should switch to the new system. Details on how to use the rendering mechanism can be found here.

textLang: TextCat

Author:Kasper Skårhøj
Created:2002-11-01T00:32:00
Changed:2005-07-20T14:36:12
Author:René Fritz
Email:r.fritz@colorcube.de
Info 3:
Info 4:

textLang: Lang guess

Extension Key: cc_langguess

Copyright 2003-2005, René Fritz, <r.fritz@colorcube.de>

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

- a GNU/GPL CMS/Framework available from www.typo3.com

Table of Contents

textLang: Lang guess 1

Introduction 1

Users manual 1

Introduction

This extension provides a service of the type 'textLang' which can be used to guess a language of a given text snippet. This service use the Perl script which can detect around 70 languages. The script itself is provided within this extension and don't have to be installed separately. To make this work you need to have Perl installed of course.

This service type is used by the DAM extension.

The difference of this service to cc_textcat is that this service work with different text encodings (charsets).

((generated))

((generated))
((generated))
The Perl script

The perl script use a package developed by Maciej Ceglowski. The package expect the text content in utf-8 encoding. Other solutions often use the encoding commonly used by the languages.

http://search.cpan.org/~mceglows/

http://www.idlewords.com/lang/ident.pl?text =

The author writes: “I've been a big fan of TextCat, and wanted to see what happened if I combined the same algorithm for n-gram based identification with some intelligence about Unicode. The result is a Unicode-friendly language identifier that makes some initial guesses based on script block. It relies on proper UTF-8 input to be happy.”

Other resources of language detection:

http://www.let.rug.nl/~vannoord/TextCat/

Users manual

The service can be used in own extension like this:

$textExcerpt = 'This is a sample text in the englisch language';
if (is_object($serviceObj = t3lib_div::makeInstanceService('textLang'))) {
    $conf['encoding'] = 'utf-8';
    $serviceObj->process(($textExcerpt, '', $conf);
    $lang_ISO_code = $serviceObj->getOutput();
    $serviceObj->__destruct();
    unset($serviceObj);
}
$content = 'The guessed language is: '.$lang_ISO_code;

The charset encoding should be provided with the option 'encoding'. Otherwise the value of $TYPO3_CONF_VARS['BE']['forceCharset'] will be used.

If your Perl installation can't be found you can configure it by adding the path to the following variable in localconf.php :

    // String, comma separated list:
    // list of absolute paths where external programs should be searched for
$TYPO3_CONF_VARS['SYS']['binPath'] = '/some/special/path/to/your/binaries/';

((generated))

((generated))
FAQ
Q: The service seems not to work. What can be the reason?

A: If the service don't work it can have following reasons:

  • Perl is not installed
  • Perl is installed in a path it can not be found
  • Your web server or PHP is configured not to allow to execute scripts
  • Your web server or PHP is configured to allow to execute scripts only in some special directories

img-1 textLang: TextCat - 2