DEPRECATION WARNING

This documentation is not using the current rendering mechanism and is probably outdated. The extension maintainer should switch to the new system. Details on how to use the rendering mechanism can be found here.

EXT: robots.txt for multidomain

Created:2010-02-18T17:33:18
Changed:2014-04-27T11:58:51.980000000
Classification:dev_null_robots
Keywords:robots, robots.txt, multidomain
Author:Wolfgang Rotschek
Email:scotty@dev-null.at
Info 4:
Language:en, de

img-1 img-2 EXT: robots.txt for multidomain - dev_null_robots

EXT: robots.txt for multidomain

Extension Key: dev_null_robots

Language: en, de

Keywords: robots, robots.txt, multidomain

Copyright 2011, Wolfgang Rotschek, <scotty@dev-null.at>

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

- a GNU/GPL CMS/Framework available from www.typo3.org

Table of Contents

EXT: robots.txt for multidomain 1

`Introduction 3 <#__RefHeading__939_1075929538>`_

What does Dev/Null Robots.txt do? 3

Features 3

Credits 3

`Users manual 4 <#__RefHeading__953_1075929538>`_

Usage 4

`Administration 5 <#__RefHeading__957_1075929538>`_

Cooluri 5

RealURL 5

.htaccess 5

`Domain Record Configuration 6 <#__RefHeading__1218_1613841388>`_

Crawler Config List 6

Default Crawler Directives 6

`Crawler-specific Configuration 7 <#__RefHeading__1224_1613841388>`_

`Page Meta Tags 8 <#__RefHeading__634_1224899791>`_

`Configuration 9 <#__RefHeading__961_1075929538>`_

Reference 9

Output Defaults for robots.txt (useDefaults) 9

`Known problems 10 <#__RefHeading__967_1075929538>`_

Feature request & Bugs 10

`To-Do list 11 <#__RefHeading__971_1075929538>`_

`ChangeLog 12 <#__RefHeading__973_1075929538>`_

Introduction

What does Dev/Null Robots.txt do?

This extension generates domain specific robots.txt files. Configuration is done from in sys_domain records and typoscript.

Features

  • Preconfigured TS-Templates for robots.txt
  • Different robots.txt per domain
  • Supports RealURL auto-configuration

Credits

Automate RealURL configuration where possible – Jacob Floyd < jacob@cognifire.com >

Users manual

  • Download and install the extension with the extension manager
  • Configure URL-rewriting for your robots.txt (as necessary)

Usage

This extension works nearly out of the box, especially if you use RealURL with automatic configuration. If you don't have that, then only URL-rewriting has to be configured. (See the Administration section for help configuring URL-Rewriting)

Assuming your website starts with id=1, make sure your robots.txt is working with the following url:

http://www.yourdomain.tld/index.php?id=1&type=1964

Next, configure URL-rewriting for each domain, as necessary. (See Administration )

[Optional] Add additional robots.txt configuration in domain records.

(See Domain Record Configuration and Crawler-specific Configuration )

When you're done, make sure that it works by visiting: http://www.yourdomain.tld/robots.txt

[Optional] Add any page specific meta-data (See Page Meta Tags )

((generated))
Example

You can see a working robots.txt at: http://www.dev- null.at/robots.txt

Administration

The extension is pre-configured to output a robots.txt whenever page type 1964 is requested. If your Templates already use type 1964, you can change the default value of typeNumin the Constant Editor. To get to the Constant Editor, go to the root of your website, edit the template, and choose “Constant Editor” from the drop down. Just remember, if you change the default typeNum, you must also manually configure your url-rewriting solution.

Depending on your chosen URL-rewriting method, you may also have to configure rewriting the robots.txt file for each domain.

Check out the following sections for configuration instructions for Cooluri, RealURL and .htaccess.

Cooluri

If you use the extension Cooluri for URL-rewriting you have to define the link for each domain that should serve a robots.txt

img-3

RealURL

If you use RealURL's automatic configuration, you're in luck; RealURL auto-configuration is supported out of the box. You may have to delete your realurl_autoconf.php file in typo3conf to get RealURL to regenerate it.

However, you must manually update your RealURL configuration if: (a) you have customized your RealURL re-writing configuration; or (b) you have changed the default typeNum setting. To manually make RealURL serve your robots.txt, add the following to your RealURL configuration (see the RealURL documentation for details on how to do this). The default configuration puts this block in the 'fileName' part of the configuration array.

'index' => array(
  'robots.txt' => array(
    'keyValues' => array(
    'type' => '1964',
  ),
),

Remember: If you change your typeNum you have to change it here too!

.htaccess

This method is not recommended. It does not work in conjunction with RealURL, and n it hasn't been tested in any other siituations. You may have to create a rewrite rule for each domain.

RewriteCond  %{HTTP_HOST}  www.yourdoman.tld
RewriteRule  ^(robots\.txt)$ index.php?id=1&type=1964 [L,NC]

RewriteRule .* index.php [L]

### End: Rewrite stuff ###

Domain Record Configuration

Domain records get two new fields: “crawler config list” and “Default crawler directives”.

img-4

Crawler Config List

List of crawler specific configurations. This will be inserted before the default configuration.

Default Crawler Directives

Default Disallow and Allow statements without User-agent. To disallow all for the given domain just enter

Disallow: /

Crawler-specific Configuration

Default crawler configurations can be linked to domain records. Each crawler configuration has two fields:

img-5

User-agent

Field

User-agent

Description

String which identifies the crawler.

Crawler config

Field

Crawler config

Description

Configuration for the crawler.

Page Meta Tags

Additional or alternately you can add robots meta tags to each page. You can setup the default behavior through the constants editor.

img-6

Within the page on the register metadata you can limit the robots for the page.

img-7

The restrictions from the typo script setup and the page record are merged. This means a noindex in the setup is reflected on all pages even if the checkbox is unchecked.

((generated))

((generated))
Example:
<meta name="robots" content="noindex,nofollow" />
Important:

Currently there is no relation between the robots.txt file and the meta tags settings.

Configuration

You can configure the defaults for this extension by modifying the config.devnullrobotsTyposcript object. You can override these defaults on a per-domain basis, by modifying the appropriate fields in your sys_domain record(s).

Reference

useDefault

Property

useDefault

Data type

option

Description

String option which default configuration from the TS-template has to be used.

off no default configuration will be added

allow use config.devnullrobots.defaultRules.defaultAllow

disallow use config.devnullrobots.defaultRules.defaultDisallow

useDefault is ignored when a domain record provides its own configuration.

Default

allow

sitemapName

Property

sitemapName

Data type

string

Description

If set, a Sitemap :directive will be written at the and of the robots.txt

Default

sitemap.xml

metaRobots

Property

metaRobots

Data type

bool

Description

If set, a robots meta tag will be written into the pages header

Default

0

metaDefault

Property

metaDefault

Data type

int

Description

Bitmask for robots meta tag

0 index,follow
1 noindex,follow
2 index,nofollow
3 noindex,nofollow

Default

0

typeNum

Property

typeNum

Data type

int

Description

typeNum for the TS-page object, responsible for the robots.txt file

Default

1964

[tsref:config.devnullrobots]

Output Defaults for robots.txt (useDefaults)

((generated))
Default 'allow' output

This is the default Robots.txt output when useDefault = allow. If you would like to create your own defaults to be used across all of the sites in your TYPO3 install, modify config.devnullrobots.defaultRules.defaultAllowin your TS-Template.

User-agent: *
Disallow: /fileadmin
Disallow: /t3lib
Disallow: /typo3conf
Disallow: /typo3temp
Allow:    /fileadmin/media
Default 'disallow' output

This is the default Robots.txt output when useDefault = disallow. If you would like to create your own defaults to be used across all of the sites in your TYPO3 install, modify config.devnullrobots.defaultRules.defaultDisallowin your TS-Template.

User-agent: *
Disallow: /

Known problems

Extension must be loaded after extension sourceopt

Feature request & Bugs

If you find a bug or a missing feature, please add it to the bugtracker of the extension at forge.

The address is:

http://forge.typo3.org/projects/extension-dev_null_robots/issues

For posting a bug or feature request you need an account of typo3.org.

You can see this extension in operation on one of my websites

http://www.dev-null.at

http://www.letsdive.at

To-Do list

ChangeLog

2.0.0

Version

2.0.0

Changes

Typo3 6.2 compatibility

1.4.0

Version

1.4.0

Changes

Added version dependency information

Tested with Typo3 4.5, 4.7, 6.0

1.3.0

Version

1.3.0

Changes

Added RealURL Auto-configuration – thanks to Jacob Floyd

1.2.0

Version

1.2.0

Changes

Site root doesn’t need to be on topmost level within pagetree anymore

1.1.1

Version

1.1.1

Changes

Added german translation – thanks to Bernhard Eckl

Added manual section for robot meta tagsReformatted manual

1.1.0

Version

1.1.0

Changes

Support for meta tags

Support for sitemap directive in robots.txt

1.0.1

Version

1.0.1

Changes

Added documentation for RealURL

1.0.0

Version

1.0.0

Changes

Initial upload

img-2 12