DEPRECATION WARNING

This documentation is not using the current rendering mechanism and is probably outdated. The extension maintainer should switch to the new system. Details on how to use the rendering mechanism can be found here.

EXT: news feeder

Author:Kasper Skårhøj
Created:2002-11-01T00:32:00
Changed:2014-11-05T10:14:25.790000000
Author:Alex Tuveri, University of Udine
Email:at@uniud.it
Info 3:http://www.luxaeterna.it
Info 4:

EXT: news feeder

Extension Key: ttnews_feeder

Copyright 2000-2014, Alex Tuveri, University of Udine, <at@uniud.it> current version: 3.0.1 BETA

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

- a GNU/GPL CMS/Framework available from www.typo3.com

Table of Contents

EXT: news feeder 1

Introduction 1

What does it do? 1

Screenshots 2

Extension tested on... 4

Stable, unstable or beta? 4

User manual 4

News approval 4

News Statistics 4

Manual Check 4

Site/Search Engine Test 4

Delete news 5

Clean Database 5

Show Configuration 5

Load site definitions 5

FAQ 6

Administration 6

Installation notes 6

Configuration example 6

FAQ 7

Configuration 7

How to define a new engine/site 7

Titles excluded, accredited and refused sites 9

How define and use keywords 9

Test mode 9

Production mode 10

CRON mode 10

Notes about the images 11

FAQ 11

Reference 12

To Do 13

Known problems 13

To-Do list 13

Changelog 14

Introduction

What does it do?

If you want to fetch news from Google, Altavista, Excite this extension might fit your needs. With this +ext you can also check sites (not engine!), parse the page and retrieve the news required. This +ext it is not a RSS system to retrieve news from search engines, for that purpose you can use another extension downloadable from typo3.org.

The advantage of a ttnews_feeder:

The product is very flexible and useful; the main purpose is to get fresh news from search engines/single (dynamic or not) sites), manually or through CRON. The aim is to have a simple system to populate your TYPO3 site and give more interesting things to you visitors.

With this extension you can :

  • Fetch news from search engines (Google, Excite, etc.)You can define several parameters: keywords to search, keywords to exclude, how many news to fetch, etc.
  • Fetch news (virtually) from static/dynamic sites that do not export their news via RSS

Among other things you can define:

  • one or more sys folder to store the news (each with its own configuraton)
  • one or more sys folder to store your keywords and search parameters
  • keywords to search on the requested engine and excluded
  • relate each keyword to the desired site/engine
  • categorize each keyword: with this option the news will be associated to the news categories and published on the
  • site in a different way according your needings.
  • Image supported: it will be downloaded and stored in you server, resized and related to the news fetched
  • titles to exclude or part of them
  • sites to exclude, undesired
  • accredited sites, if loaded and recognized the news will be automatically published on the FE
  • run mode: test mode and production for each site: CRON mode, MANUAL CHECK , CRON+MANUAL CHECK to satisfy all needings.
  • CRON mode keeps clean your DB for internal/external news without any operator intervention
  • Full report via email for CRON mode for administrator
  • Partial report for the news responsible

Screenshots

Manual check as you can see some records was accepted automatically and published, other is waiting for approval. Photos and Images are retrieved and stored in your server!

img-1

When you click on Manual check ttnews_feeder connect to Google and other engines or static sites previously defined and fetch the news according the given parameters. An icon explains the record status : News Feeder check for duplicated records and marks the status as refused.

You can run the test mode and simulate the production mode; this is very comfortable way to test one or more sites and pass them in production mode when all is OK.

WARNING! Don't press the button “Run manual check” twice! Just pressed, some browsers like msie7+ seems to do nothing. Just wait for the results.

News approval (TYPO3 4.0.2+):Three options: suspend, delete , approve .

img-2

Test mode You can select individually the site you need to test or invert selection. Sites hidden wil not be considered.

img-3

Load sites definition Configure the commonest search engine is very easy, simple select what you want and click the button – you are ready to run. Define one or more keywords and you fetch the news!!!

img-4

Extension tested on...

This extension works fine and was tested successfully on TYPO3 3.8.1, 4.0, 4.0.2, 4.1.1 and 4.1.4, under PHP 4.4.x -> and PHP 5.2.x ->.

Stable, unstable or beta?

Since v.1.0 News Feeder was declared *stable* because it can read the news and extractc correctly contents (except last v. 1.1.20-22 cause changes in HTML code by Google.it/.com). However this extension works correctlyt (see Todo List and Known problems) and will be declared Beta only if there are major problems causing great instability.However some problems could depend from new sites definitions not loaded. At each update do not forget to reload site definition.

User manual

News approval

When ttnews_feeder is launched interactively or via CRON it stores the news in the DB for the sites marked 'production' ; news fetched from accredited sites are immediately published (to do it, please configure your TSConfig properly, parameter: clearCachePages).If you don't clear the cache or the page cache is not cleared using other methods your news will not available in the FE.News approval is very easy.

Please select the item:

News approval from top-right menu and wait.

For each news item you will see some data and the url. If you want to check the original page press to the url and the page will be opened in a new window. Click on the radio-button:

suspend keep the news suspended, no effect on status

delete delete the news (hidden)

approve news is approved and published

Just decided what to do, press the Confirm button.

News Statistics

Here you can see the stats for news published, deleted, to approve, etc.

Manual Check

Click on Web > News Feederand click on your FEEDER FOLDER. I suggest you, before run a Manual Check, to define correctly one or more sites and then test them through the menu 'Site/Search Engine Test'.

Manual Check loads the news retrieved in your database; the news fetched from accredited sites will be immediately available online if you set the cache parameters correctly (see forward for the parameter: clearCachePages).

Site/Search Engine Test

This otpion is only for Admins - Click on Web > News Feederand click on your FEEDER FOLDER. You can define one or more search engines/site to visit and easily fetch the news required. Just you have define one site and marked it as 'test-site' you can try if it works correctly and test the criteria loaded for exclusion or automatic approval.Note: this option visit all sites and repeats the visit for each keyword associated.

Delete news

This option allow you to delete manually all news updated, according the preferences selected for each searc engine/site defined. News will be not really deleted, it will be loaded on your database as record marked 'deleted'. This is very useful because News Feeder will check if a title is already loaded and all criteria will work until the record will be removed definitely.

Clean Database

Acts only on the records deleted with the previous; records will be definitely removed from your database after the number of days according to your preferences, see option removeExternalOldNews (external News)and helps you to maintain clean your database for your internal News, see the option: removeMyOldNews.

Images note : this option will remove definitely all images related to your news.

Show Configuration

This is a simple report for each search engine, showing if the engine is under test, hidden and other parameters for delete and clean options.

Load site definitions

This otpion is only for Admins - This allows the admin user to load any of predefined sites listed and checked.

New site if the site was not loaded before, it will be added on your database. Each new site added will be configured to run on test mode; to run it on production mode you should edit the record properties and change the status.

Update if the site was created before using News feeder the site (if checked) will be automatically updated. Updating process modifies only the fields containing the occurrences to extract records from the page and the site url to connect.You must re-update your site definitions when something goes wrong (i.e. You can't more read news from Google.com).

Important – if you need update your sites, remember that running this options News Feeder doesn't uses internet to establish a connection and download new definitions. You must to reinstall the extension. To do it the best way is to download directly from typo3.org/extensions/and avoid older version (often mirrors are not updated).

Warning – updating process override all fields values and it is based on the creation date for the records listed. The only way to use data from a pre-defined site is to copy it ONLY using the BE interface; infact the creation date changes and you have a new site that will be not more update. This could happen e.g. you are dutch and you need to copy ' google.com news ' site to keep the original site and modify the copied (e.g. google.nl news). Read the following steps:

  • first time load ' google. *com* news ' site definitions
  • through BE interface make a copy and paste
  • rename the new (copied) site to google. *nl* according to your needings (adjust the name, URL, etc. connecting before to google. *nl* and after doing some tests).
  • Edit the new (copied) site and apply your modifications, make a test
  • Next time News Feeder will not touch the google. *nl* site definition; it will update only google.com definitions.
  • If you want collaborate please send me a copy of your definition (you can save from BE – simply press right button -- your window and attach it to the email).

Latest site update

news.google.com

Site name

news.google.com

Site type

Search engine

Review date

NOT SUPPORTED(1)

news.google.it

Site name

news.google.it

Site type

Search engine

Review date

NOT SUPPORTED(1)

yahoo.com news (english)

Site name

yahoo.com news (english)

Site type

Search engine

Review date

Dec 2011

yahoo.it news (italian)

Site name

yahoo.it news (italian)

Site type

Search engine

Review date

Dec 2011

yahoo.it news (german)

Site name

yahoo.it news (german)

Site type

Search engine

Review date

Dec 2011

it.bing.com/ (talian)

Site name

it.bing.com/ (talian)

Site type

Search engine

Review date

Dec 2011 (2)

www.bing.com (deutsch)

Site name

www.bing.com (deutsch)

Site type

Search engine

Review date

Sept 2012 (2)

(1) Since Dec,1 2011 the news published via google are displayed in the page of the browser using javascript, so that it is not possible to fetch them. Within 1-2 months a new +ext will be available to read the news using POP3 and store the fresh news into DB.(2) Bing detects the location of your server and give you back the news according the languare of your location. Deutsch mean that the language of the news will be 'German' for the connections to Bing from Germany.

HINT : re-edit your keywords and relate them to new search engines to ensure fresh news for your site (you can use yahoo, bing, etc.)

About yahoo.it/.com - this engine show images that cannot be fetched by News Feeder because the images published are not related to any news.

FAQ

None

Administration

Installation notes

This +ext is reserved to administrators only. However if you limit the access of your folders (this will be explained in detailed mode in the future) you will able to allow the news approval, deletion and other to one ore more BE users.

This manual is under development, so that to run correctly the extension I suggest to follow step-by-step the configuration instructions; see next Chapter.

Legal issue: somewhere in your site please cite the sites/engine visited .

It is recommend to read carefully this steps, otherwise it will be very difficult to run correctly the extension!

Install the extension from admin BE user.

just installed please clear /typo3conf cache.

Confirm the requested DB modifications. The extension requires tt_news + ext installed and will add a new field to tt_newstable: this is required to understand from what site the news was fetched.

Create a sysfolder (i.e. name it NEWS_FEEDER) to store your configuration parameters and take note of the PID number:

a. within your site you can create one or more folders – suggested: create one folder.

b. edit the page properties of your FOLDER or another page in your root-line (above your page...) and in the Tsconfiginsert the following configuration lines – simple you can copy/paste them:

Configuration example

(copy and paste, then change references...):

mod.web_txttnewsfeederM1 {
  clearCachePages = 1,364,365,366,367,369,370,378,383
  useRandomTime = 1
  fetchImages = 1
  resizeImages = 1
  resizedJpgCompression = 60
  resizedImagePxWidth = 80
  maxImageByteSize = 20000
  maxImagePxWidth = 240
  maxImagePxHeight = 240
  useSubIfTitleIsEmpty = 1
  useTitleIfSubIsEmpty = 1
  backDays = 7
  suspendFlag = 0
  autosuspendLimit = 100
  maxRecordsPerSession = 30
  feederSysFolderPID = 353
  newsSysFolderPID = 360
  removeExternalOldNews = 20
  removeMyOldNews = 360
  debugFeed = 0
  charSet = cp1252

cronWriteOnlyAccredited = 1

}

Images

TEST/MANUAL CHECK - If you need to download images note that News Feeder you must define the following parameters (like above):

fetchImages = 1
resizeImages = 1

CRON MODE - If you need to download images remember that the pictures are written within uploads/pics. News Feeder assigns automatically them to owner/group of the folder uploads/pics. However you can force another owner/group adding this two params to the configuration above:

apacheOwner = www-data
apacheGroup = www-data

Finally you can set the compression quality for Jpg file formats and the limits (see reference).

Clear the cache

Just in CRON mode, when the feeder is over, you need to clear the cache for some pages, using this parameter:

mod.web_txttnewsfeederM1.clearCachePages = all

depending on your needings you can use 'pages', temp_CACHED' (see TYPO3 API reference).

FAQ

- none

Configuration

Before run this extension It is recommend to read carefully this steps, configure it (see *Administration* ) otherwise it will be very difficult to run correctly the extension!

How to define a new engine/site

Define your first engine

Within your sysfolder assigned to the FEEDER create your first site. The following example concerns the configuration parameters for the engine:

http://www.google.it

As stated before this manual is reserved only to Administrators (see Users Manual). Thus the best way to put on work this extension is to follow the following instruction step-by-step. In the future will be published new documentation to explain how to do (configure a new site, learn and study html, etc.).

If you are admin you can load a new engine or define a new one. To start as soon as possible, run News Feeder and select the last option from the drop-down menu: ' Load sites definition' .This option allows you to create a new engine; the definitions are stored within a file you received with this extension.News feeder will check and create a new engine for you:

Google (test mode) news.google.it

This engine-setup works fine and was tested for a long time. Tag- definitions inside are related for the Google news engine in ITALIAN language (http://www.google.it); google.com news was tested on Jan 04, 2007 and works fine. Now I can connect and read the pages: contact me only if sites definition preloaded do not work correctly. However google.com recently changed html code output for the news and since Jan 04, 2007 all is OK.

Now open your FEEDER folder (from BE interface: List -> select your folder) and you will see what happened. Modify the Google (test mode) news.google.itrecord and you will see the page with the parameters needed to fetch the news.

Warning : This extension works using GET vars, the PHP file functionto fetch the pages and PHP eregifunction to accept or exclude sites/titles. Thus if you don't know how to, please refer to http://www.php.net . The +ext does not use navigators (could be in the future) and therefore is unable to send POST data.

Brief explanation of used fields :

Hide if engine is hidden it will not be processed by ttnews_feeder

Search engine name site/engine name

Scheme default: http://, alternative: https:// - Trick: to do the test please save the remote page (using Mozilla, Explorer, etc.) on your hard disk and transfer it on your server. This way is useful to avoid to stress remote server for testing.

Url url for connection. Here you can use some markers:

###RECORDSTOVIEW### how many records retrieve (i.e. 10,20,50,100); content is defined under keywords table###SEARCHKW### this will be substituted with the search keywords; content is defined under keywords table###EXCLUDEKW### this will be substituted with the keywords to exclude; content is defined under keywords table

Charset You can select one of the listed items. All strings (title, subtitle, font) will be translated to this charset. If you don't know what to do try cp1252.If you see some undesired chars try to change this parameter until the problem disappears. Content unwrap this is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the page. Content means all the block of a page containing all the news.

Section unwrap

this is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the Content (above) to extract each news (title, subtitle, font, etc.).

Title unwrap this is a tag or piece of a tag and a tag or piece of a tag that tells to the +ext what fetch from the Section (above) to extract the title.

Subtitle, Font and Link unwrap Like above.

Subtitle extraction method If the title of the news and its subtitle is located in a page , select: ' from search page (url above) ': will be used the URL field to fetch the subtitle – means from the same page.Otherwise you must select: ' from target page, news link '. This second option can slow the extracting process because News Feeder loads another page to examine and fetch the subtitle. The page depends on the link extracted (see below Link unwrap ) If the text is long it will be truncated to the first 255 chars found, preserving the last word found (this is not a simple and bad crop!)

image unwrap, if any found in the section If the section extracted c(captured with Section unwrap ) ontains an image and you configured with the parameter fetchImages = 1 (bool) News Feeder will download the images recognized as TYPO3 configuration parameters defined during installation process. The images will be stored within the /uploads/pics/ folder of your site.Images greater maxImageBytesSize parameter will not be written and thus ignored.All tags for extraction are divided by the marker ###SEP###, you should use this markers and the url markers to project a new engine/site. If you need to define a new site, you must study carefully the page and define correctly these unwraps, then use the TEST MODE to test if the site is working correctly and at the end pass the site in production mode (MANUAL CHECK or CRON MODE).

Link unwrap This is used to fetch the link that points to the site where the entire news is published (see also subtitle extraction method).

Url to add to the extracted link somewhat could happen that a site (expecially when static ) point to internal news using only relative references (i.e.:/index.php?id=28). If this site is indexed by ttnews_feeder we cannot publish on our TYPO3 site the relative path, then the +ext adds this url to reconstruct the entire ( absolute ) path. note : if you are configuring a static/dynamic site and theimage unwrapis set, this url will be used to fetch the images. When News Feeder analyze the url it looks if the URL starts with 'http://' or 'https://' (absolute paths); if not it will compose what fetched prepending this parameter.

Autoclean (interactive or CRON mode)– If enabled you can delete (not remove!) records expired and defined in the next box:

Autoclean backdays All news related to this site will be considered as deletion after the days here defined. News deleted will be still present in thte database, used for title/url exclusion, but will not available for visitors.

Mode Running mode. At the first time please select Test mode .

Check every n days Check frequency under Cron/Manual check mode: '0' means each day, otherwise write the number of days between one check and the next. Note : if you leave this field empty News Feeder will use 0.

Notes Internal notes. When you proceed with an UPDATE this field will be preserved and News Feeder will add the UPDATE date and hour.

Titles excluded, accredited and refused sites

This tables are used for exclude or accredited sites and the use is intuitive and easy. A Title excluded field need to specify the url related to this title, you can use REGEXP. As stated before, please refer to PHP site for REGEXP syntax.

How define and use keywords

Define your keywords - Within your system folder assigned to the FEEDER create your keywords. The following example concerns the configuration parameters for the keywords. Here you can define several keywords and configure them individually to obtain different results. Each keyword can be related to one or more sites:

Hide if keyword is hidden it will not be processed by ttnews_feeder

keyword search keyword: you must to use the syntax connection to the search engine desired, i.e. For Google you can load this field with:antivirus+security(use '+' as separator)

but not... keyword (or list of keywords) to exclude, typically Google uses:+-microsoft+-HIV+-flu

search engines select from the right-box the search engine you want to explore using the keyword. Note that Google, Yahoo, Excite use the same syntax for keyword. For sites that use different syntax for keyword definition and exclusion you must to open a new keyword.

Category here you can select one or more categories to relate the news extracted and approved. This is very useful if you need to aggregate news in your site using tt_news plugin. Refere tott_newsdocumentation to know how to create categories.

Notes internal notes. Put here what you want and remember.

I suggest you to define one or more search engine and then define the keywords. You can associate (relate) each keywords to one or more search engines, but each configured keyword must respect the syntax ot the search engine(s) selected: google, altavista, excite uses the same syntax. If the syntax is different, you must to define another keyword for the desired search engine.

How to define a keyword correctly – To avoid errors, please follow the steps below:

  • using your preferred browser connect to the desired engine (i.e. http://news.google.it)
  • fill the search box and run a search i.e. Using the following keywords: bush -powell (stays for search for bush news but avoid the ' powell ' contents)
  • click on the search button
  • note that the URL box has changed, for the example above you will see: http://news.google.it/news?hl=it&ned=it&q= ` bush+-powell <http://news.google.it/news?hl=it&ned=it&q=bush+-pow ell&btnG=Cerca+nelle+notizie>`_ `&btnG=Cerca+nelle+notizie <http://new s.google.it/news?hl=it&ned=it&q=bush+-powell&btnG=Cerca+nelle+notizie> `_
  • well, now you can see the way google uses to pass the GET vars.
  • Fill the field keyword(see previous paragraph *Define your keywords* ) inserting: bush
  • Fill the field but not...(see previous paragraph *Define your keywords* ) inserting: +-powell
  • finaly associate your keyword to the search engine and run a test.
  • When all is OK, change your search engine properties switching to production mode

Test mode

Just configured the extension, defined a keyword and search engine, you can do a test.

Test mode doesn't write any record on your DB and it is a great method to check if your engine-configuration is working well.To run test- mode click on:

img-5

and then in the right-frame select the menu item:

Test news engine/sites

read the text, select the name of the site to test (or All) and click on the button:

Run site/engine test

Note : if you see nothing probably you have not defined yet. Test mode is very similar to production mode, only the modifications will not be written. The only difference is when from test mode there is a DB check for the records already stored. The records displayed have an icon on the left. Right side there is a brief explanation (this is called ' news status ').Images will be not written on your server only displayed through a link to remote site.

Production mode

Just configured the extension, and tested the site/engine as explained you can modify the site/engine status in production mode (refer to the engine configuration to do it).

When a site is under production mode records will be written in the DB. To run follow the instruction:

Click on

img-5

and then in the right-frame select the menu item:

Run Manual Check

read the text and click on the button:

Run Manual Check

please wait some seconds for conclusion and read what fetched.

Note : if you have deleted a record (manually or automatically refused) the record will be only hidden and it is stored in the DB. It will be deleted (removed definitely) only using the menu item: Clean DB. It is very important to keep on mind that if you remove the records definitively using ttnews_feeder or other utilities, the +ext cannot more check if a certain news is yet stored and if you run a new manual (or CRON) check the fresh news will be reloaded.Images, if any, will be written on your server within the folder upload/pics, according with parameters given – images upper than maxImageByteSize will be skipped.

CRONmode

Since v. 3.0.1 you must to remove all CRONTAB entries and modify as follows.

First add to your site a new BE user with the name:

_cli _ttnewsfeeder

Set the parameter newsBEOwner (see reference):

mod.web_txttnewsfeederM1.newsBEOwner = <uid>

if you want to edit/display the news fetched remember to set the uid above to '1' (usually this is the uid for Admin user); otherwise use another BE user uid or, if you want, write the uid of the user:

_cli _ttnewsfeeder

it's your own choice depending on security issues and privileges assigned to various BE users.

Since ttnews_feeder v. 3.0.1 I suggest you to install and configure the system extension SCHEDULER, then configure a new task to fech the news using ttnews_feeder, with the desired interval.

Finally set the cron tab adding a line like this:

5  *  *  *  *  php -q /var/www/www.example.com/web/typo3/cli_dispatch.phpsh scheduler

Please ajust the path /var/www... of your site and refer to the dispatcher configuration, that is part of the core.

Under some circumstances you will need to change access for ttnews_feeder_cli.phpsh:

chmod 0755 <path-to-your-site>typo3conf/ext/ttnews_feeder/Classes/Cli/ttnews_feeder_cli.phpsh

Warning ! The News Feeder behaviour will be the same as in the BE. Then I suggest you to try before in the BE.Using CRON News Feeder will fetch news and, for the accredited sites, the news will be published immediately!!!This is a good way to automatize your site but can be some risks so that I suggest you to select carefully the site to define as 'accredited'. The other news, coming from not accredited sites will be stored in your data base and you must to approve the manually. Don't forget that you must define at least a keyword and/or an engine and select the MODE:CRON MODE or CRON MODE+MANUAL MODE

Suspend CRON mode You can suspend CRON (i.e. When you are on vacation...) setting

suspendFlag = 1

Set this parameter:autoSuspendLimit = <value>

with a proper value and when CRON detects that news not approved are over the limit CRON will not fetch and store news.

How to receive a report via email If you are admin set CRON like above, at the end of the line add the chars here in bold:

(...) ttnews_feeder_cli.phpsh | admin@your-domain.com

If admin and there are more people that are responsible for the news approval each for a different section, you will receive the same report you see in interactive mode (BE) for all section activated.Otherwise, if you want that each of responsible for a certain section receives an email with a report, in the modTSConfig (see Reference) configure the parameter:newsResponsibleEmail

At each CRON running the responsible will receive an email with its own report.

Store only the accredited site records

If you set cronWriteOnlyAccreditedto '1' and CRON TASK is active News Feeder will store in the db only the records coming from accredited sites. This could be very useful if you need to automatize completely the approval process avoiding manual approval.Valid records, usually get for manually approval, are stored in the db and marked as deleted so that News Feeder can recognize them and reject again on the next check.

Cron keeps your DB clean! If you set suspendFlag to 1 and CRON TASK is active News Feeder will be launched and will keep clean your db, checking for records to delete and erase.

Notes about the images

Images download is available only if you set to true (1) the fetchImagesparameter. However if you want that downloaded images are resized to a certain value (e.g. 100 px), you must to set the autoresizeImagesparameter too.

If you set up autoresizeImagesto true (1) the images will be first resized and only after resized the images will be measured and accepted according to maxImageByteSize, maxImagePxWidth, maxImagePxHeightparameters. Values. Check for extensions allowed – News Feeder accept first the images extensions allowed by TYPO3 general configuration. Note that autoresize option is allowed only for JPEG, JPG, GIF, PNG images format. If autoresize is on and an image has not any of these format, it will accepted and measured as described above and, if it is oversized it will be refused.

Autoresize images – I suggest to keep it on because you save disk- space in your server and you will have more and more images for your news because the images will be rarely refused.

Images quality – First release with image support (v 1.1.16) was not tested with PNG format and could be improved. Please contact me if images will be displayed as not expected so I can introduce news code for resizing.

Images and tt_news – If you order News Feeder to resize images please keep note that all images will be resized from tt_news extensions to create thumbnails in news listing and others. Please note that the best way to avoid low quality is to define some tt_news parameters (max images width and max images eight) greater/equals of resizedImagePxWidth.The height will be calculated automatically from News Feeder.

FAQ

Why can't I see anything under test mode? Check if you configuration is ok (header unwrap etc.) then verify if your site. Acommon error for the engines is that they need to be related from a keyword definition. If you have not loaded a keyword related to your (new) site, your site will be not visited.

I've had just loaded a new definition, run a manual test and I can't see nothing. Why? You can define several sites/engine but to run them you must create at least one keyword and associate it (relate) to your engine. So, if you have just loaded a new engine (i.e. Google ) please load a new keyword and from the menu select the engine.

Parsing 'news.google.it' sometimes a subtitle disappears. Why? The extension extract the text using the 'unwrap' parameters passed through the search engine definition. Some google records are different and the extension cannot extract them correctly. However the title is always available.

I'm Italian and I have loaded news.google.COM site definition. Nothing works, why? The extension connects to news.google.com but google redirects to italian service: news.google.it. The pages are formatted differently and the extension cannot fetch record if the site is redirected.

Reference

Most important configuration in order to guarantee the correct implementation:

  • Define the pid of the ttnews_feeder system folder
  • Define the uid of the (user): news owner

- Reference (TSconfig): ttnews_feeder – News Feeder

clearCachePages

Property

clearCachePages

Data type

int+/string

Description

List of all page pid's you need to clear from cache. This will run at the end of the process so that the fresh news of accredited sites will be immediately available on BE (since v.2.1.1 you can use also: pages,all,temp_CACHED)

Default

useSubIfTitleIsEmpty

Property

useSubIfTitleIsEmpty

Data type

boolean

Description

1 (true), 0 (false) – If set to 1 when the news field Title is not extracted (for some reasons...) it will be substituted by the subtitle with limit to 60 chars

Default

1

useTitleIfSubIsEmpty

Property

useTitleIfSubIsEmpty

Data type

boolean

Description

1 (true), 0 (false) – If set to 1 when the news field Subtitle is not extracted (for some reasons...) it will be substituted by the Title with limit to 250 chars

Default

1

BackDays

Property

BackDays

Data type

int+

Description

Under evaluation; currently not used

Default

7

suspendFlag

Property

suspendFlag

Data type

boolean

Description

Set to '1' if you are on vacation: this will suspend any fetching through CRON

Default

0

autosuspendLimit

Property

autosuspendLimit

Data type

int+

Description

Works only in CRON mode. If this limit is reached (e.g. There is not any operator to approve fresh news, cause vacation...) no more news are accepted and stored in the DB. The counter keep track only of approved news. This prevents from DB overload.

Default

100

maxRecordsPerSession

Property

maxRecordsPerSession

Data type

int+

Description

Works only in MANUAL CHECK mode. If this limit is reached no more news are accepted and stored in the DB. The counter keep track only of approved news.

Default

30

feederSysFolderPID

Property

feederSysFolderPID

Data type

int+

Description

The PID of the page where store your configuration tables (keywords, sites/engine to visit, etc.).

Default

required

newsSysFolderPID

Property

newsSysFolderPID

Data type

int+

Description

The PID of the page where store your EXTERNAL NEWS. I suggest to keep separated your internal and external news so that it will be more easy for you to inspect them.

Default

ul

newsBEOwner

Property

newsBEOwner

Data type

int+

Description

Use this parameter only if you wish write into tt_news table the same user id, otherwise will be used the user UID of the BE user running News Feeder.

Default

1

removeExternalOldNews

Property

removeExternalOldNews

Data type

int+

Description

Days back - When this limit is reached: CRON (if used) will remove expired news; if you work in MANUAL CHECK, the news will be removed manually

Default

50

removeMyOldNews

Property

removeMyOldNews

Data type

string

Description

Days back - When this limit is reached: CRON (if used) will remove expired news; if you work in MANUAL CHECK, the news will be removed manually.

Default

920

charSet

Property

charSet

Data type

String

Description

Charset for Html conversion, same parameters of the PHP htmlentities function

Default

cp1252

maxImageByteSize

Property

maxImageByteSize

Data type

int+

Description

Max dimension for images fetched

Default

15000

fetchImages

Property

fetchImages

Data type

bool

Description

Fetch or not the images from site/engine, default: disabled

Default

0

maxImagePxWidth

Property

maxImagePxWidth

Data type

int+

Description

If the image captured width is over this limit, it will be refused

Default

300

maxImagePxHeight

Property

maxImagePxHeight

Data type

int+

Description

If the image captured height is over this limit, it will be refused

Default

300

resizeImages

Property

resizeImages

Data type

bool

Description

Autoresize for the images downloaded, if set all Images will be resized according to the resizedImagePxWidthparameter

Default

0

resizedImagePxWidth

Property

resizedImagePxWidth

Data type

int+

Description

This works only if fetchImagesandresizeImagesare both set to 1 (true). If an image is less or more than the parameter; e.g. If the width of downloaded image is 120 pixels the width of resulting image will be 80 pixels width; if it is 60 pixels the new width will be 80 pixels.

Default

80

resizedJpgCompression

Property

resizedJpgCompression

Data type

int+

Description

Compression for output image if extension is JPG or JPEG; use 100 for no compression.

Default

70

useRandomTime

Property

useRandomTime

Data type

bool

Description

Date and hour set for the news fetched will be calculated randomly or not. You can disable this setting to '0'; this can be usefull to fetch news according to importace order of search engine visited

Default

1

newsResponsibleEmail

Property

newsResponsibleEmail

Data type

String

Description

Type a valid email address. Each time CRON will be executed an email containing a report will be sent to this email address.

Default

cronWriteOnlyAccredited

Property

cronWriteOnlyAccredited

Data type

Bool

Description

If set to '1' and News Feeder is running under CRON, only the records of accredited site will be written in the db.

Default

apacheOwner

Property

apacheOwner

Data type

String

Description

CRON mode: images downloaded will be set with this owner.Default: owner of uploads/pics.

Default

Same ofuploads/pics

apacheGroup

Property

apacheGroup

Data type

String

Description

CRON mode: images downloaded will be set with this group.Default: owner of uploads/pics.

Default

Same ofuploads/pics

[tsref:(cObject).web_txttnewsfeederM1]

To Do

  • a new +ext to read Google news via POP3 (within february 2012)
  • improve settings (site defs) and add some new engines
  • integrate with scheduler.

Known problems

  • Since Dec 01, 2011 ttnews_feeder cannot fetch google.it/.com/.de news because google publish the news in your browser * exclusively * using javascript. News are not coded and readable. Within 1-2 months a new +ext will be issued to read google records via POP3.

  • Running the feeder via CRON if you made two or more (different) BE FOLDERS the news fetched are store improperly. Please avoid to use more than one folder , this will be fixed soon.

  • Running the feeder from BE, using the SCHEDULER (manually) you should see the record fetched on the screen. SCHEDULER mode requires to be adjusted and today this way is not perfect. Moreover I tried to add the code for the SCHEDULER but the scheduler refuse to be configured and I got this error:

    PHP Fatal error: Class 'tx_ttnews_feeder_schedule' not found in /var/www/typo3_src-4.4.5/t3lib/class.t3lib_div.php on line 5260 this is under evaluation - > instead use the manual confitguration to run the feeder from CRON.

  • if you run the extension using WEB ACCELERATOR, please disable it because the images will be not calculated correctly. PHP doesn't use WEB ACCELERATOR and the images fetched are the same as the remote site.

  • Check your memory limit for PHP – News Feeder was tested under a server with the value configured to 72MB with image fetching enabled and thus the extension ran very slow. A value of 96 MB could be right to work correctly. If you have not access to server configuration (i.e. hosting plan limited to 64 MB or less, consider to disable the download of images to reduce time and resources consumption).Please inform me if you face problems: at(at)uniud.it

To-Do list

SOME things to-do:

  • check for bugs under T3 6.2.X; I do not tried to downlad and reuse images
  • test more extensively for base64 decode of the image tag
  • A new menu for the BE with some infos/log about CRON mode.
  • improve output messages and log for updating process
  • keywords for static/dynamic sites....
  • for each keyword enable or disable image fetching...
  • documentation in italian language
  • static/dynamic sites: add code to fetch full news and import in DB (long text, news type= internal)
  • static/dynamic sites (not engines!) add a field to exclude undesired keywords
  • static/dynamic sites (not engines!) add a field to relate news fetched to one or more news category

Changelog

  • 05-10-2014 (v.3.0.1, beta) – Minor manual modifications (Crontab section).
  • 04-10-2014 (v.3.0.0, beta) – Now is compatible with TYPO3 6.2; please avoid to install for previous version. Manual updated; CRON must be reconfigured.
  • 05-09-2012 (v.2.7.0) - some code changes to ensure 4.7.x compatibility. Not yet compatible with 6.x., review of site definition (now google is removed).
  • 05-12-2011 (v.2.5.0) – new site definition upgraded: google not supported (the record will be hidden after the upgrade). News engines: yahoo.com for DE, IT, EN - BING for italian
  • 04-02-2011 (v.2.4.8) – new site definition upgraded, minor bug fixed. Now works with dispatcher from BE
  • 29-10-2009 (v. 2.3.3) - guide updated , new site definition upgraded
  • 29-08-2009 (v. 2.3.2) – documentation updated.
  • 28-08-2009 (v. 2.3.1) - modified htmlspecialchars_decode adding some code to ensure compatibility with PHP < 5.1; thanks to Andreas Weigelt for discovering this “bug”.
  • 01-06-2009 (v. 2.2.12) – add the use of htmlspecialchars_decode for the URL retrieved, unfortunately this features restricts the use to PHP v.5.1+
  • 28-02-2009 (v.2.2.11) – Site definition updated, guide updated. Google has just changed the format of HTML page and since today news.google.it and news.google.com have the same parameters.
  • 04-05-2008 (v.2.2.2 and v.2.2.3) – guide updated , new site definition
  • 26-03-2008 (v.2.2.1) – guide updated (some little mistakes)
  • 22-03-2008 (v.2.2.1) – Cron mode: pcitures downloaded perms are set (default) with owner/group of upload/pics; you can override this parameter.
  • 22-03-2008 (v.2.2.0) – Cron mode now downloads correctly the images.
  • 10-02-2007 (v.2.1.5) – Output suppressed (debug), site definition updated.
  • 27-12-2007 (v.2.1.1) – Bug fixes – Function to clear cache now works accepting more parameters (see reference) Property: mod.web_txttnewsfeederM1.clearCachePages = allclear modified to clear all cache, pages and list of id.
  • 23-12-2007 (v.2.0.7) – Bug fixes – Library class modified (if there is only a site defined, news wasn't fetched). Italian definition for google.it doesn't work correctly because the URL was not defined correctly. Guide updated.
  • 27-07-2007 (v.2.0.4) – New TS config, LI mode report messages added/improved
  • 28-05-2007 (v.2.0.2) – Minor bug fixes, CLI mode report messages added/improved
  • 22-05-2007 (v.2.0.1) – Two bug fixes – External news not removed (all Modes), mail not starting in CRON mode.
  • 22-05-2007 (v.2.0.0) – Major release – Now works in CRON mode, PHP code has been reviewed and heavily modified; a new class introduced; some improvements and minor bux fixed. Guide updated for CRON and other.
  • 06-01-2007 (v.1.2.2) – new field for search engine/sites charset (please update this version and reload site/definitions)
  • 06-01-2007 (v.1.2.1) – ajusted Google.news definition; images: some PHP code modified to preserve colors.
  • 05-01-2007 (v1.2) – new site definitions for google.it news ; modified some code for update, new documentation.
  • 04-01-2007 (v.1.1.21) – new site definitions for google.com news ; modified some code for update.
  • 02.01.2007 (v. 1.1.16) – images autoresize feature impelented; field check every n days it is not more required because a bug of Typo3 to testing this type of field.
  • 30.12.2006 (v. 1.1.12 to v. 1.1.15) – minor bug fixing
  • 22.12.2006 (v. 1.1.11) – add url parameter adjusted for static/dynamic sites to allow remote image fetching; manual upgraded, a message substituted; font inserted before subtitle in test/production mode.
  • 17.12.2006 (v. 1.1.10) - all (little) bugs connected to image management are removed.
  • 12.12.2006 - modified userBEowner parameter; access bug: not Admin users now can load, delete and remove records.
  • 30.11.2006 - new DB field to configure how every day/s start the check for site/engine.
  • 26.11.2006 - new TSConf parameter: useTitleIfSubIsEmpty, fills the subtitle with title if subtitle is empty; new TSConf parameter:: useRandomTime, you can enable/disable this feature – if disabled the records will be displayed according to the fetching order and all with the same hour and minute; title/subtitle check: if both are empty record is refused. image status introduced with refused/accepted and bytes message; mandatory field for titles and url to exclude; better specified that you can use REGEXP; option DELETE for image uploaded on approval; messages for images accepted/refused on test and production mode
  • 21.11.2006 - delete expired news: corrected the code to show how many records to clean
  • 20.11.2006 - image downloading support
  • 19.11.2006 - stable version with site definition update feature implemented
  • 14.11.2006 - problem discovered: you must copy/paste the code not in your feeder-folder but in you root-page properties!
  • 11.11.2006 - new parameter for charset conversion; new function: load site definitions; error messages improved; checkboxes to select one or more sites; manual on-line updated
  • 30.10.2006 - Second version: manual upgrade, a new field introduced for the scheme. Minor changes, cache not cleared fixed.
  • 25.10.2006 - First version published

img-6 EXT: news feeder - 15