.. You may want to use the usual include line. Uncomment and adjust the path. 
.. include:: ../Includes.txt


================
EXT: news feeder
================

:Author:
      Kasper Skårhøj

:Created:
      2002-11-01T00:32:00

:Changed:
      2014-11-05T10:14:25.790000000

:Author:
      Alex Tuveri, University of Udine

:Email:
      at@uniud.it

:Info 3:
      http://www.luxaeterna.it

:Info 4:


.. _EXT-news-feeder:

EXT: news feeder
================

Extension Key:  **ttnews\_feeder**

Copyright 2000-2014, Alex Tuveri, University of Udine, <at@uniud.it>
**current version: 3.0.1 BETA**

This document is published under the Open Content License

available from http://www.opencontent.org/opl.shtml

The content of this document is related to TYPO3

\- a GNU/GPL CMS/Framework available from www.typo3.com

.. _Table-of-Contents:

Table of Contents
-----------------

**EXT: news feeder 1**

**Introduction 1**

What does it do? 1

Screenshots 2

Extension tested on... 4

Stable, unstable or beta? 4

**User manual 4**

News approval 4

News Statistics 4

Manual Check 4

Site/Search Engine Test 4

Delete news 5

Clean Database 5

Show Configuration 5

Load site definitions 5

FAQ 6

**Administration 6**

Installation notes 6

Configuration example 6

FAQ 7

**Configuration 7**

How to define a new engine/site 7

Titles excluded, accredited and refused sites 9

How define and use keywords 9

Test mode 9

Production mode 10

CRON mode 10

Notes about the images 11

FAQ 11

Reference 12

**To Do 13**

**Known problems 13**

**To-Do list 13**

**Changelog 14**


.. _Introduction:

Introduction
------------


.. _What-does-it-do:

What does it do?
^^^^^^^^^^^^^^^^

If you want to fetch news from Google, Altavista, Excite this
extension might fit your needs. With this +ext you can also check
sites (not engine!), parse the page and retrieve the news required.
This +ext it is not a RSS system to retrieve news from search engines,
for that purpose you can use another extension downloadable from
typo3.org.

**The advantage of a ttnews\_feeder:**

The product is very flexible and useful; the main purpose is to get
fresh news from search engines/single (dynamic or not) sites),
manually or through CRON. The aim is to have a simple system to
populate your TYPO3 site and give more interesting things to you
visitors.

*With this extension you can* :

- Fetch news from search engines (Google, Excite, etc.)You can define
  several parameters: keywords to search, keywords to exclude, how many
  news to fetch, etc.

- Fetch news (virtually) from static/dynamic sites that do not export
  their news via RSS


*Among other things you can define:*

- one or more sys folder to store the news (each with its own
  configuraton)

- one or more sys folder to store your keywords and search parameters

- keywords to search on the requested engine and excluded

- relate each keyword to the desired site/engine

- categorize each keyword: with this option the news will be associated
  to the news categories and published on the

- site in a different way according your needings.

- Image supported: it will be downloaded and stored in you server,
  resized and related to the news fetched

- titles to exclude or part of them

- sites to exclude, undesired

- accredited sites, if loaded and recognized the news will be
  automatically published on the FE

- run mode: test mode and production for each site: CRON mode, MANUAL
  CHECK , CRON+MANUAL CHECK to satisfy all needings.

- CRON mode keeps clean your DB for internal/external news without any
  operator intervention

- Full report via email for CRON mode for administrator

- Partial report for the news responsible


.. _Screenshots:

Screenshots
^^^^^^^^^^^

**Manual check** as you can see some records was accepted
automatically and published, other is waiting for approval. Photos and
Images are retrieved and stored in your server!

|img-1|

When you click on  **Manual check**  *ttnews\_feeder* connect to
Google and other engines or static sites previously defined and fetch
the news according the given parameters. An icon explains the record
*status* : News Feeder check for duplicated records and marks the
status as refused.

You can run the  **test mode** and  **simulate the production mode;**
this is very comfortable way to test one or more sites and pass them
in production mode when all is OK.

**WARNING!** Don't press the button “Run manual check” twice! Just
pressed, some browsers like msie7+ seems to do nothing. Just wait for
the results.

**News approval** (TYPO3 4.0.2+):Three options:  **suspend,**
**delete** ,  **approve** .

|img-2|

**Test mode** You can select individually the site you need to test or
invert selection. Sites hidden wil not be considered.

|img-3|

**Load sites definition** Configure the commonest search engine is
very easy, simple select what you want and click the button – you are
ready to run. Define one or more keywords and you fetch the news!!!

|img-4|


.. _Extension-tested-on:

Extension tested on...
^^^^^^^^^^^^^^^^^^^^^^

This extension works fine and was tested successfully on TYPO3 3.8.1,
4.0, 4.0.2, 4.1.1 and 4.1.4, under PHP 4.4.x -> and PHP 5.2.x ->.


.. _Stable-unstable-or-beta:

Stable, unstable or beta?
^^^^^^^^^^^^^^^^^^^^^^^^^

Since v.1.0 News Feeder was declared  ***stable*** because it can read
the news and extractc correctly contents (except last v. 1.1.20-22
cause changes in HTML code by Google.it/.com). However this extension
works correctlyt (see Todo List and Known problems) and will be
declared Beta only if there are major problems causing great
instability.However some problems could depend from new sites
definitions not loaded. At each update  **do not forget to reload site
definition.**


.. _User-manual:

User manual
-----------


.. _News-approval:

**News approval**
^^^^^^^^^^^^^^^^^

When ttnews\_feeder is launched interactively or via CRON it stores
the news in the DB for the sites marked 'production' ; news fetched
from accredited sites are immediately published (to do it, please
configure your TSConfig properly, parameter: clearCachePages).If you
don't clear the cache or the page cache is not cleared using other
methods your news will not available in the FE.News approval is very
easy.

Please select the item:

**News approval** from top-right menu and wait.

For each news item you will see some data and the url. If you want to
check the original page press to the url and the page will be opened
in a new window. Click on the radio-button:

**suspend** keep the news suspended, no effect on status

**delete** delete the news (hidden)

**approve** news is approved and published

Just decided what to do, press the  **Confirm** button.


.. _News-Statistics:

**News Statistics**
^^^^^^^^^^^^^^^^^^^

Here you can see the stats for news published, deleted, to approve,
etc.


.. _Manual-Check:

**Manual Check**
^^^^^^^^^^^^^^^^

Click on Web > News Feederand click on your FEEDER FOLDER. I suggest
you, before run a Manual Check, to define correctly one or more sites
and then test them through the menu 'Site/Search Engine Test'.

Manual Check loads the news retrieved in your database; the news
fetched from accredited sites will be immediately available online if
you set the cache parameters correctly (see forward for the parameter:
clearCachePages).


.. _Site-Search-Engine-Test:

**Site/Search Engine Test**
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**This otpion is only for Admins -** Click on Web > News Feederand
click on your FEEDER FOLDER. You can define one or more search
engines/site to visit and easily fetch the news required. Just you
have define one site and marked it as 'test-site' you can try if it
works correctly and test the criteria loaded for exclusion or
automatic approval.Note: this option visit all sites and repeats the
visit for each keyword associated.


.. _Delete-news:

**Delete news**
^^^^^^^^^^^^^^^

This option allow you to delete manually all news updated, according
the preferences selected for each searc engine/site defined. News will
be not really deleted, it will be loaded on your database as record
marked 'deleted'. This is very useful because News Feeder will check
if a title is already loaded and all criteria will work until the
record will be removed definitely.


.. _Clean-Database:

**Clean Database**
^^^^^^^^^^^^^^^^^^

Acts only on the records deleted with the previous; records will be
definitely removed from your database after the number of days
according to your preferences, see option removeExternalOldNews
(external News)and helps you to maintain clean your database for your
internal News, see the option: removeMyOldNews.

*Images note* : this option will remove definitely all images related
to your news.


.. _Show-Configuration:

**Show Configuration**
^^^^^^^^^^^^^^^^^^^^^^

This is a simple report for each search engine, showing if the engine
is under test, hidden and other parameters for delete and clean
options.


.. _Load-site-definitions:

Load site definitions
^^^^^^^^^^^^^^^^^^^^^

**This otpion is only for Admins -** This allows the admin user to
*load* any of predefined sites listed and checked.

**New site** if the site was not loaded before, it will be added on
your database. Each new site added will be configured to run on test
mode; to run it on production mode you should edit the record
properties and change the status.

**Update** if the site was created before using  **News feeder** the
site (if checked) will be automatically updated. Updating process
modifies only the fields containing the occurrences to extract records
from the page and the site url to connect.You must re-update your site
definitions when something goes wrong (i.e. You can't more read news
from Google.com).

*Important* – if you need update your sites, remember that running
this options News Feeder doesn't uses internet to establish a
connection and download new definitions. You must to reinstall the
extension. To do it the best way is to download directly from
typo3.org/extensions/and avoid older version (often mirrors are not
updated).

*Warning* – updating process override all fields values and it is
based on the  **creation date** for the records listed. The only way
to use data from a pre-defined site is to copy it ONLY using the BE
interface; infact the creation date changes and you have a new site
that will be not more update. This could happen e.g. you are
**dutch** and you need to copy ' *google.com news* ' site to keep the
original site and modify the copied (e.g.  *google.nl news).* Read the
following steps:

- first time load ' *google.*  ***com***  *news* ' site definitions

- through BE interface make a copy and paste

- rename the new (copied) site to  *google.*  ***nl*** according to your
  needings (adjust the name, URL, etc. connecting before to  *google.*
  ***nl*** and after doing some tests).

- Edit the new (copied) site and apply your modifications, make a test

- Next time News Feeder will not touch the  *google.*  ***nl*** site
  definition; it will update only google.com definitions.

- If you want  *collaborate* please send me a copy of your definition
  (you can save from BE – simply press right button -- your window and
  attach it to the email).

Latest site update

.. ### BEGIN~OF~TABLE ###


.. _news-google-com:

news.google.com
"""""""""""""""

.. container:: table-row

   Site name
         news.google.com
   
   Site type
         Search engine
   
   Review date
         NOT SUPPORTED(1)


.. _news-google-it:

news.google.it
""""""""""""""

.. container:: table-row

   Site name
         news.google.it
   
   Site type
         Search engine
   
   Review date
         NOT SUPPORTED(1)


.. _yahoo-com-news-english:

yahoo.com news (english)
""""""""""""""""""""""""

.. container:: table-row

   Site name
         yahoo.com news (english)
   
   Site type
         Search engine
   
   Review date
         Dec 2011


.. _yahoo-it-news-italian:

yahoo.it news (italian)
"""""""""""""""""""""""

.. container:: table-row

   Site name
         yahoo.it news (italian)
   
   Site type
         Search engine
   
   Review date
         Dec 2011


.. _yahoo-it-news-german:

yahoo.it news (german)
""""""""""""""""""""""

.. container:: table-row

   Site name
         yahoo.it news (german)
   
   Site type
         Search engine
   
   Review date
         Dec 2011


.. _it-bing-com-talian:

it.bing.com/ (talian)
"""""""""""""""""""""

.. container:: table-row

   Site name
         it.bing.com/ (talian)
   
   Site type
         Search engine
   
   Review date
         Dec 2011 (2)


.. _www-bing-com-deutsch:

www.bing.com (deutsch)
""""""""""""""""""""""

.. container:: table-row

   Site name
         www.bing.com (deutsch)
   
   Site type
         Search engine
   
   Review date
         Sept 2012 (2)


.. ###### END~OF~TABLE ######

(1) Since Dec,1 2011 the news published via google are displayed in
the page of the browser using javascript, so that it is not possible
to fetch them. Within 1-2 months a new +ext will be available to read
the news using POP3 and store the fresh news into DB.(2) Bing detects
the location of your server and give you back the news according the
languare of your location. Deutsch mean that the language of the news
will be 'German' for the connections to Bing from Germany.

**HINT** : re-edit your keywords and relate them to new search engines
to ensure fresh news for your site (you can use yahoo, bing, etc.)

**About yahoo.it/.com** - this engine show images that cannot be
fetched by News Feeder because the images published are not related to
any news.


.. _FAQ:

FAQ
^^^

None


.. _Administration:

Administration
--------------


.. _Installation-notes:

Installation notes
^^^^^^^^^^^^^^^^^^

This +ext is reserved to administrators only. However if you limit the
access of your folders (this will be explained in detailed mode in the
future) you will able to allow the news approval, deletion and other
to one ore more BE users.

This manual is under development, so that to run correctly the
extension I suggest to follow step-by-step the configuration
instructions; see next Chapter.

*Legal issue: somewhere in your site please cite the sites/engine
visited* .

**It is recommend to read carefully this steps, otherwise it will be
very difficult to run correctly the extension!**

Install the extension from admin BE user.

just installed please clear /typo3conf cache.

Confirm the requested DB modifications. The extension requires
**tt\_news +** ext installed and will add a new field to
tt\_newstable: this is required to understand from what site the news
was fetched.

Create a  *sysfolder* (i.e. name it NEWS\_FEEDER) to store your
configuration parameters and take note of the PID number:

a. within your site you can create one or more folders – suggested:
create one folder.

b. edit the  **page properties of your FOLDER or another page in your
root-line (above your page...)** and in the Tsconfiginsert the
following configuration lines – simple you can copy/paste them:


.. _Configuration-example:

Configuration example
^^^^^^^^^^^^^^^^^^^^^

(copy and paste, then change references...):

::

   mod.web_txttnewsfeederM1 {
     clearCachePages = 1,364,365,366,367,369,370,378,383
     useRandomTime = 1
     fetchImages = 1
     resizeImages = 1
     resizedJpgCompression = 60
     resizedImagePxWidth = 80
     maxImageByteSize = 20000
     maxImagePxWidth = 240
     maxImagePxHeight = 240
     useSubIfTitleIsEmpty = 1
     useTitleIfSubIsEmpty = 1
     backDays = 7
     suspendFlag = 0
     autosuspendLimit = 100
     maxRecordsPerSession = 30
     feederSysFolderPID = 353
     newsSysFolderPID = 360
     removeExternalOldNews = 20
     removeMyOldNews = 360
     debugFeed = 0
     charSet = cp1252

cronWriteOnlyAccredited = 1

::

   }

**Images**

TEST/MANUAL CHECK - If you need to download  **images** note that
*News Feeder* you must define the following parameters (like above):

::

     fetchImages = 1
     resizeImages = 1

CRON MODE - If you need to download  **images** remember that the
pictures are written within uploads/pics. News Feeder assigns
automatically them to owner/group of the folder uploads/pics. However
you can force another owner/group adding this two params to the
configuration above:

::

     apacheOwner = www-data
     apacheGroup = www-data

Finally you can set the compression quality for  *Jpg* file formats
and the limits (see reference).

**Clear the cache**

Just in CRON mode, when the feeder is over, you need to clear the
cache for some pages, using this parameter:

::

   mod.web_txttnewsfeederM1.clearCachePages = all
   

depending on your needings you can use 'pages', temp\_CACHED' (see
TYPO3 API reference).


.. _FAQ:

FAQ
^^^

\- none


.. _Configuration:

Configuration
-------------

**Before run this extension It is recommend to read carefully this
steps, configure it (see**  ***Administration***  **) otherwise it
will be very difficult to run correctly the extension!**


.. _How-to-define-a-new-engine-site:

**How to define a new engine/site**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Define your first engine**

Within your sysfolder assigned to the FEEDER create your first site.
The following example concerns the configuration parameters for the
engine:

`http://www.google.it <http://www.google.it/>`_

As stated before this manual is reserved only to Administrators (see
Users Manual). Thus the best way to put on work this extension is to
follow the following instruction step-by-step. In the future will be
published new documentation to explain how to do (configure a new
site, learn and study html, etc.).

If you are  *admin* you can load a new engine or define a new one. To
start as soon as possible, run News Feeder and select the last option
from the drop-down menu: ' *Load sites definition'* .This option
allows you to create a new engine; the definitions are stored within a
file you received with this extension.News feeder will check and
create a new engine for you:

::

   Google (test mode) news.google.it

This engine-setup works fine and was tested for a long time. Tag-
definitions inside are related for the Google news engine in ITALIAN
language (http://www.google.it); google.com news was tested on Jan 04,
2007 and works fine. Now I can connect and read the pages:  **contact
me only** if sites definition preloaded do not work correctly. However
google.com recently changed html code output for the news and since
Jan 04, 2007 all is OK.

Now open your FEEDER folder (from BE interface: List -> select your
folder) and you will see what happened. Modify the Google (test mode)
news.google.itrecord and you will see the page with the parameters
needed to fetch the news.

*Warning* : This extension works using GET vars, the PHP file
functionto fetch the pages and PHP eregifunction to accept or exclude
sites/titles. Thus if you don't know how to, please refer to
`http://www.php.net <http://www.php.net/>`_ . The +ext does not use
navigators (could be in the future) and therefore is  *unable* to send
POST data.

*Brief explanation of used fields* :

**Hide** if engine is hidden it will not be processed by
*ttnews\_feeder*

**Search engine name** site/engine name

**Scheme** default: http://, alternative: https:// - Trick: to do the
test please save the remote page (using Mozilla, Explorer, etc.) on
your hard disk and transfer it on your server. This way is useful to
avoid to stress remote server for testing.

**Url** url for connection. Here you can use some markers:

###RECORDSTOVIEW### how many records retrieve (i.e. 10,20,50,100);
content is defined under keywords table###SEARCHKW### this will be
substituted with the search keywords; content is defined under
keywords table###EXCLUDEKW### this will be substituted with the
keywords to exclude; content is defined under keywords table

**Charset** You can select one of the listed items. All strings
(title, subtitle, font) will be translated to this charset. If you
don't know what to do try cp1252.If you see some undesired chars try
to change this parameter until the problem disappears. **Content
unwrap** this is a tag or piece of a tag and a tag or piece of a tag
that tells to the +ext what fetch from the page. Content means all the
block of a page containing *all* the news.

::

   Section unwrap

this is a tag or piece of a tag and a tag or piece of a tag that tells
to the +ext what fetch from the Content (above) to extract each news
(title, subtitle, font, etc.).

**Title unwrap** this is a tag or piece of a tag and a tag or piece of
a tag that tells to the +ext what fetch from the Section (above) to
extract the title.

**Subtitle, Font and Link unwrap** Like above.

**Subtitle extraction method** If the title of the news and its
subtitle is located in a page , select: ' **from search page (url
above)** ': will be used the URL field to fetch the subtitle – means
from the same page.Otherwise you must select: ' **from target page,
news link** '. This second option can slow the extracting process
because News Feeder loads another page to examine and fetch the
subtitle. The page depends on the link extracted (see below  **Link
unwrap** ) If the text is long it will be truncated to the first 255
chars found, preserving the last word found (this is not a simple and
bad crop!)

**image unwrap, if any found in the section** If the section extracted
c(captured with *Section unwrap* ) ontains an image and you configured
with the parameter fetchImages = 1 (bool) News Feeder will download
the images recognized as TYPO3 configuration parameters defined during
installation process. The images will be stored within the
/uploads/pics/ folder of your site.Images greater maxImageBytesSize
parameter will not be written and thus ignored.All tags for extraction
are divided by the marker ###SEP###, you should use this markers and
the url markers to project a new engine/site. If you need to define a
new site, you must study carefully the page and define correctly these
unwraps, then use the TEST MODE to test if the site is working
correctly and at the end pass the site in production mode (MANUAL
CHECK or CRON MODE).

**Link unwrap** This is used to fetch the link that points to the site
where the entire news is published (see also subtitle extraction
method).

**Url to add to the extracted link** somewhat could happen that a site
(expecially when *static* ) point to internal news using only relative
references (i.e.:/index.php?id=28). If this site is indexed by
*ttnews\_feeder* we cannot publish on our TYPO3 site the relative
path, then the +ext adds this url to reconstruct the entire (
*absolute* ) path. *note* : if you are configuring a static/dynamic
site and theimage unwrapis set, this url will be used to fetch the
images. When News Feeder analyze the url it looks if the URL starts
with 'http://' or 'https://' (absolute paths); if not it will compose
what fetched prepending this parameter.

**Autoclean** (interactive or CRON mode)– If enabled you can delete
(not remove!) records expired and defined in the next box:

**Autoclean backdays** All news related to this site will be
considered as deletion after the days here defined. News deleted will
be still present in thte database, used for title/url exclusion, but
will not available for visitors.

**Mode** Running mode. At the first time please select **Test mode** .

**Check every n days** Check frequency under Cron/Manual check mode:
'0' means each day, otherwise write the number of days between one
check and the next. *Note* : if you leave this field empty News Feeder
will use 0.

**Notes** Internal notes. When you proceed with an UPDATE this field
will be preserved and News Feeder will add the UPDATE date and hour.


.. _Titles-excluded-accredited-and-refused-sites:

Titles excluded, accredited and refused sites
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This tables are used for exclude or accredited sites and the use is
intuitive and easy. A Title excluded field need to specify the url
related to this title, you can use REGEXP. As stated before, please
refer to PHP site for  **REGEXP** syntax.


.. _How-define-and-use-keywords:

How define and use keywords
^^^^^^^^^^^^^^^^^^^^^^^^^^^

*Define your keywords* - Within your system folder assigned to the
FEEDER create your keywords. The following example concerns the
configuration parameters for the keywords. Here you can define several
keywords and configure them individually to obtain different results.
Each keyword can be related to one or more sites:

**Hide** if keyword is hidden it will not be processed by
*ttnews\_feeder*

**keyword** search keyword: you must to use the syntax connection to
the search engine desired, *i.e.* For Google you can load this field
with:antivirus+security(use '+' as separator)

**but not...** keyword (or list of keywords) to exclude, typically
Google uses:+-microsoft+-HIV+-flu

**search engines** select from the right-box the search engine you
want to explore using the keyword. Note that Google, Yahoo, Excite use
the same syntax for keyword. For sites that use different syntax for
keyword definition and exclusion you must to open a new keyword.

**Category** here you can select one or more categories to relate the
news extracted and approved. This is very useful if you need to
aggregate news in your site using tt\_news plugin. Refere
tott\_newsdocumentation to know how to create categories.

**Notes** internal notes. Put here what you want and remember.

**I suggest**  *you to define one or more search engine* and then
define the keywords. You can associate (relate) each keywords to one
or more search engines, but each configured keyword must respect the
syntax ot the search engine(s) selected: google, altavista, excite
uses the same syntax. If the syntax is different, you must to define
another keyword for the desired search engine.

**How to define a keyword correctly –** To avoid errors, please follow
the steps below:

- using your preferred browser connect to the desired engine (i.e.
  *http://news.google.it)*

- fill the search box and run a search i.e. Using the following
  keywords: *bush -powell* (stays for search for  *bush* news but avoid
  the ' *powell* ' contents)

- click on the search button

- note that the URL box has changed, for the example above you will see:
  `http://news.google.it/news?hl=it&ned=it&q= <http://news.google.it/new
  s?hl=it&ned=it&q=bush+-powell&btnG=Cerca+nelle+notizie>`_ `
  **bush+-powell**  <http://news.google.it/news?hl=it&ned=it&q=bush+-pow
  ell&btnG=Cerca+nelle+notizie>`_ `&btnG=Cerca+nelle+notizie <http://new
  s.google.it/news?hl=it&ned=it&q=bush+-powell&btnG=Cerca+nelle+notizie>
  `_

- well, now you can see the way google uses to pass the GET vars.

- Fill the field keyword(see previous paragraph  ***Define your
  keywords***  *) inserting:* bush

- Fill the field but not...(see previous paragraph  ***Define your
  keywords***  *) inserting:* +-powell

- finaly associate your keyword to the search engine and run a test.

- When all is OK, change your search engine properties switching to
  *production mode*


.. _Test-mode:

**Test mode**
^^^^^^^^^^^^^

Just configured the extension, defined a keyword and search engine,
you can do a test.

Test mode doesn't write any record on your DB and it is a great method
to check if your  *engine-configuration* is working well.To run test-
mode click on:

|img-5|

and then in the right-frame select the menu item:

::

   Test news engine/sites

read the text, select the name of the site to test (or All) and click
on the button:

::

   Run site/engine test

*Note* : if you see nothing probably you have not defined yet. Test
mode is very similar to production mode, only the modifications will
not be written. The only difference is when from test mode there is a
DB check for the records already stored. The records displayed have an
icon on the left. Right side there is a brief explanation (this is
called ' *news status* ').Images will be not written on your server
only displayed through a link to remote site.


.. _Production-mode:

**Production mode**
^^^^^^^^^^^^^^^^^^^

Just configured the extension, and tested the site/engine as explained
you can modify the site/engine status in production mode (refer to the
engine configuration to do it).

When a site is under production mode records will be written in the
DB. To run follow the instruction:

Click on

|img-5|

and then in the right-frame select the menu item:

::

   Run Manual Check

read the text and click on the button:

::

   Run Manual Check

please wait some seconds for conclusion and read what fetched.

*Note* : if you have deleted a record (manually or automatically
refused) the record will be only hidden and it is stored in the DB. It
will be deleted (removed definitely) only using the menu item: Clean
DB. It is very important to keep on mind that if you remove the
records  *definitively* using  *ttnews\_feeder* or other utilities,
the +ext cannot more check if a certain news is yet stored and if you
run a new manual (or CRON) check the fresh news will be
reloaded.Images, if any, will be written on your server within the
folder  *upload/pics, according with parameters given – images upper
than* maxImageByteSize *will be skipped.*


.. _CRONmode:

**CRONmode**
^^^^^^^^^^^^

**Since v. 3.0.1 you must to remove all CRONTAB entries and modify as
follows.**

First add to your site a new BE user with the name:

::

   _cli _ttnewsfeeder

Set the parameter  **newsBEOwner** (see reference):

::

   mod.web_txttnewsfeederM1.newsBEOwner = <uid>

if you want to edit/display the news fetched remember to set the uid
above to '1' (usually this is the uid for Admin user); otherwise use
another BE user uid or, if you want, write the uid of the user:

::

   _cli _ttnewsfeeder 

it's your own choice depending on security issues and privileges
assigned to various BE users.

**Since ttnews\_feeder v. 3.0.1** I suggest you to install and
configure the system extension SCHEDULER, then configure a new task to
fech the news using ttnews\_feeder, with the  **desired interval.**

Finally set the cron tab adding a line like this:

::

   5  *  *  *  *  php -q /var/www/www.example.com/web/typo3/cli_dispatch.phpsh scheduler
   
   
Please ajust the path /var/www... of your site and refer to the
dispatcher configuration, that is part of the core.

Under some circumstances you will need to change access for
ttnews\_feeder\_cli.phpsh:

::

   chmod 0755 <path-to-your-site>typo3conf/ext/ttnews_feeder/Classes/Cli/ttnews_feeder_cli.phpsh

**Warning** ! The News Feeder behaviour will be the same as in the BE.
Then I suggest you to try before in the BE.Using CRON News Feeder will
fetch news and, for the accredited sites, the news will be published
immediately!!!This is a good way to automatize your site but can be
some risks so that I suggest you to select carefully the site to
define as 'accredited'. The other news, coming from not accredited
sites will be stored in your data base and you must to approve the
manually.  **Don't forget that you must define at least a keyword
and/or an engine and select the MODE:CRON MODE**  *or*  **CRON
MODE+MANUAL MODE**

**Suspend CRON mode** You can  *suspend* CRON (i.e. When you are on
vacation...) setting

::

   suspendFlag = 1

Set this parameter:autoSuspendLimit = <value>

with a proper value and when CRON detects that news not approved are
over the limit CRON will not fetch and store news.

**How to receive a report via email** If you are admin set CRON like
above, at the end of the line add the chars here in bold:

::

   (...) ttnews_feeder_cli.phpsh | admin@your-domain.com

If admin and there are more people that are responsible for the news
approval each for a different section, you will receive the same
report you see in interactive mode (BE) for all section
activated.Otherwise, if you want that each of responsible for a
certain section receives an email with a report, in the modTSConfig
(see Reference) configure the parameter:newsResponsibleEmail

At each CRON running the responsible will receive an email with its
own report.

**Store only the accredited site records**

If you set cronWriteOnlyAccreditedto '1' and CRON TASK is active News
Feeder will store in the db only the records coming from accredited
sites. This could be very useful if you need to automatize completely
the approval process avoiding manual approval.Valid records, usually
get for manually approval, are stored in the db and marked as deleted
so that News Feeder can recognize them and reject again on the next
check.

**Cron keeps your DB clean!** If you set suspendFlag to 1 and CRON
TASK is active News Feeder will be launched and will keep clean your
db, checking for records to delete and erase.


.. _Notes-about-the-images:

Notes about the images
^^^^^^^^^^^^^^^^^^^^^^

Images download is available only if you set to true (1) the
fetchImagesparameter. However if you want that downloaded images are
resized to a certain value (e.g. 100 px), you must to set the
autoresizeImagesparameter too.

If you set up autoresizeImagesto true (1) the images will be first
resized and only  **after** resized the images will be measured and
accepted according to maxImageByteSize, maxImagePxWidth,
maxImagePxHeightparameters. Values. **Check for extensions allowed** –
News Feeder accept first the images extensions allowed by TYPO3
general configuration. Note that autoresize option is allowed only for
JPEG, JPG, GIF, PNG images format. If autoresize is on and an image
has not any of these format, it will accepted and measured as
described above and, if it is oversized it will be refused.

**Autoresize images –** I suggest to keep it on because you save disk-
space in your server and you will have more and more images for your
news because the images will be rarely refused.

**Images quality –** First release with image support (v 1.1.16) was
not tested with PNG format and could be improved. Please contact me if
images will be displayed as not expected so I can introduce news code
for resizing.

**Images and tt\_news –** If you order News Feeder to resize images
please keep note that all images will be resized from tt\_news
extensions to create thumbnails in news listing and others. Please
note that the best way to avoid low quality is to define some tt\_news
parameters (max images width and max images eight) greater/equals of
resizedImagePxWidth.The  *height* will be calculated automatically
from News Feeder.


.. _FAQ:

FAQ
^^^

**Why can't I see anything under test mode?** Check if you
configuration is ok (header unwrap etc.) then verify if your site.
Acommon error for the engines is that they need to be related from a
keyword definition. If you have not loaded a keyword related to your
(new) site, your site will be not visited.

**I've had just loaded a new definition, run a manual test and I can't
see nothing. Why?** You can define several sites/engine but to run
them you must create at least one keyword and associate it (relate) to
your engine. So, if you have just loaded a new engine (i.e.  *Google*
) please load a new keyword and from the menu select the engine.

**Parsing 'news.google.it' sometimes a subtitle disappears. Why?** The
extension extract the text using the 'unwrap' parameters passed
through the search engine definition. Some  *google* records are
different and the extension cannot extract them correctly. However the
title is always available.

**I'm Italian and I have loaded news.google.COM site definition.
Nothing works, why?** The extension connects to news.google.com but
google redirects to italian service: news.google.it. The pages are
formatted differently and the extension cannot fetch record if the
site is redirected.


.. _Reference:

Reference
^^^^^^^^^

Most important configuration in order to guarantee the correct
implementation:

- Define the pid of the  *ttnews\_feeder* system folder

- Define the uid of the (user): news owner

\- Reference (TSconfig): ttnews\_feeder –  **News Feeder**

.. ### BEGIN~OF~TABLE ###


.. _clearCachePages:

clearCachePages
"""""""""""""""

.. container:: table-row

   Property
         clearCachePages
   
   Data type
         int+/string
   
   Description
         List of all page pid's you need to clear from cache. This will run at
         the end of the process so that the fresh news of accredited sites will
         be immediately available on BE (since v.2.1.1 you can use also:
         pages,all,temp\_CACHED)
   
   Default
         -


.. _useSubIfTitleIsEmpty:

useSubIfTitleIsEmpty
""""""""""""""""""""

.. container:: table-row

   Property
         useSubIfTitleIsEmpty
   
   Data type
         boolean
   
   Description
         1 (true), 0 (false) – If set to 1 when the news field Title is not
         extracted (for some reasons...) it will be substituted by the subtitle
         with limit to 60 chars
   
   Default
         1


.. _useTitleIfSubIsEmpty:

useTitleIfSubIsEmpty
""""""""""""""""""""

.. container:: table-row

   Property
         useTitleIfSubIsEmpty
   
   Data type
         boolean
   
   Description
         1 (true), 0 (false) – If set to 1 when the news field Subtitle is not
         extracted (for some reasons...) it will be substituted by the Title
         with limit to 250 chars
   
   Default
         1


.. _BackDays:

BackDays
""""""""

.. container:: table-row

   Property
         BackDays
   
   Data type
         int+
   
   Description
         Under evaluation; currently not used
   
   Default
         7


.. _suspendFlag:

suspendFlag
"""""""""""

.. container:: table-row

   Property
         suspendFlag
   
   Data type
         boolean
   
   Description
         Set to '1' if you are on vacation: this will suspend any fetching
         through CRON
   
   Default
         0


.. _autosuspendLimit:

autosuspendLimit
""""""""""""""""

.. container:: table-row

   Property
         autosuspendLimit
   
   Data type
         int+
   
   Description
         Works only in CRON mode. If this limit is reached (e.g. There is not
         any operator to approve fresh news, cause vacation...) no more news
         are accepted and stored in the DB. The counter keep track only of
         approved news. This prevents from DB overload.
   
   Default
         100


.. _maxRecordsPerSession:

maxRecordsPerSession
""""""""""""""""""""

.. container:: table-row

   Property
         maxRecordsPerSession
   
   Data type
         int+
   
   Description
         Works only in MANUAL CHECK mode. If this limit is reached no more news
         are accepted and stored in the DB. The counter keep track only of
         approved news.
   
   Default
         30


.. _feederSysFolderPID:

feederSysFolderPID
""""""""""""""""""

.. container:: table-row

   Property
         feederSysFolderPID
   
   Data type
         int+
   
   Description
         The PID of the page where store your configuration tables (keywords,
         sites/engine to visit, etc.).
   
   Default
         required


.. _newsSysFolderPID:

newsSysFolderPID
""""""""""""""""

.. container:: table-row

   Property
         newsSysFolderPID
   
   Data type
         int+
   
   Description
         The PID of the page where store your EXTERNAL NEWS. I suggest to keep
         separated your internal and external news so that it will be more easy
         for you to inspect them.
   
   Default
         ul


.. _newsBEOwner:

newsBEOwner
"""""""""""

.. container:: table-row

   Property
         newsBEOwner
   
   Data type
         int+
   
   Description
         Use this parameter only if you wish write into tt\_news table the same
         user id, otherwise will be used the user UID of the BE user running
         News Feeder.
   
   Default
         1


.. _removeExternalOldNews:

removeExternalOldNews
"""""""""""""""""""""

.. container:: table-row

   Property
         removeExternalOldNews
   
   Data type
         int+
   
   Description
         Days back - When this limit is reached: CRON (if used) will remove
         expired news; if you work in MANUAL CHECK, the news will be removed
         manually
   
   Default
         50


.. _removeMyOldNews:

removeMyOldNews
"""""""""""""""

.. container:: table-row

   Property
         removeMyOldNews
   
   Data type
         string
   
   Description
         Days back - When this limit is reached: CRON (if used) will remove
         expired news; if you work in MANUAL CHECK, the news will be removed
         manually.
   
   Default
         920


.. _charSet:

charSet
"""""""

.. container:: table-row

   Property
         charSet
   
   Data type
         String
   
   Description
         Charset for Html conversion, same parameters of the PHP htmlentities
         function
   
   Default
         cp1252


.. _maxImageByteSize:

maxImageByteSize
""""""""""""""""

.. container:: table-row

   Property
         maxImageByteSize
   
   Data type
         int+
   
   Description
         Max dimension for images fetched
   
   Default
         15000


.. _fetchImages:

fetchImages
"""""""""""

.. container:: table-row

   Property
         fetchImages
   
   Data type
         bool
   
   Description
         Fetch or not the images from site/engine, default: disabled
   
   Default
         0


.. _maxImagePxWidth:

maxImagePxWidth
"""""""""""""""

.. container:: table-row

   Property
         maxImagePxWidth
   
   Data type
         int+
   
   Description
         If the image captured width is over this limit, it will be refused
   
   Default
         300


.. _maxImagePxHeight:

maxImagePxHeight
""""""""""""""""

.. container:: table-row

   Property
         maxImagePxHeight
   
   Data type
         int+
   
   Description
         If the image captured height is over this limit, it will be refused
   
   Default
         300


.. _resizeImages:

resizeImages
""""""""""""

.. container:: table-row

   Property
         resizeImages
   
   Data type
         bool
   
   Description
         Autoresize for the images downloaded, if set all Images will be
         resized according to the resizedImagePxWidthparameter
   
   Default
         0


.. _resizedImagePxWidth:

resizedImagePxWidth
"""""""""""""""""""

.. container:: table-row

   Property
         resizedImagePxWidth
   
   Data type
         int+
   
   Description
         This works only if fetchImagesandresizeImagesare both set to 1 (true).
         If an image is less or more than the parameter; e.g. If the width of
         downloaded image is 120 pixels the width of resulting image will be 80
         pixels width; if it is 60 pixels the new width will be 80 pixels.
   
   Default
         80


.. _resizedJpgCompression:

resizedJpgCompression
"""""""""""""""""""""

.. container:: table-row

   Property
         resizedJpgCompression
   
   Data type
         int+
   
   Description
         Compression for output image if extension is JPG or JPEG; use 100 for
         no compression.
   
   Default
         70


.. _useRandomTime:

useRandomTime
"""""""""""""

.. container:: table-row

   Property
         useRandomTime
   
   Data type
         bool
   
   Description
         Date and hour set for the news fetched will be calculated randomly or
         not. You can disable this setting to '0'; this can be usefull to fetch
         news according to importace order of search engine visited
   
   Default
         1


.. _newsResponsibleEmail:

newsResponsibleEmail
""""""""""""""""""""

.. container:: table-row

   Property
         newsResponsibleEmail
   
   Data type
         String
   
   Description
         Type a valid email address. Each time CRON will be executed an email
         containing a report will be sent to this email address.
   
   Default
         -


.. _cronWriteOnlyAccredited:

cronWriteOnlyAccredited
"""""""""""""""""""""""

.. container:: table-row

   Property
         cronWriteOnlyAccredited
   
   Data type
         Bool
   
   Description
         If set to '1' and News Feeder is running under CRON, only the records
         of accredited site will be written in the db.
   
   Default


.. _apacheOwner:

apacheOwner
"""""""""""

.. container:: table-row

   Property
         apacheOwner
   
   Data type
         String
   
   Description
         CRON mode: images downloaded will be set with this owner.Default:
         owner of uploads/pics.
   
   Default
         Same ofuploads/pics


.. _apacheGroup:

apacheGroup
"""""""""""

.. container:: table-row

   Property
         apacheGroup
   
   Data type
         String
   
   Description
         CRON mode: images downloaded will be set with this group.Default:
         owner of uploads/pics.
   
   Default
         Same ofuploads/pics


.. ###### END~OF~TABLE ######

[tsref:(cObject).web\_txttnewsfeederM1]


.. _To-Do:

To Do
-----

- **a new +ext to read Google news via POP3 (within february 2012)**

- improve settings (site defs) and add some new engines

- integrate with scheduler.


.. _Known-problems:

Known problems
--------------

- **Since Dec 01, 2011** ttnews\_feeder cannot fetch google.it/.com/.de
  news because google publish the news in your browser \* exclusively \*
  using javascript. News are not coded and readable. Within 1-2 months a
  new +ext will be issued to read google records via POP3.

- Running the feeder via CRON if you made two or more (different) BE
  FOLDERS the news fetched are store improperly.  **Please avoid to use
  more than one folder** , this will be fixed soon.

- Running the feeder from BE, using the  **SCHEDULER** (manually) you
  should see the record fetched on the screen. SCHEDULER mode requires
  to be adjusted and today this way is not perfect. Moreover I tried to
  add the code for the SCHEDULER but the scheduler refuse to be
  configured and I got this error:
  
  *PHP Fatal error: Class 'tx\_ttnews\_feeder\_schedule' not found in
  /var/www/typo3\_src-4.4.5/t3lib/class.t3lib\_div.php on line 5260*
  this is under evaluation - >  **instead** use the manual
  confitguration to run the feeder from CRON.

- if you run the extension using WEB ACCELERATOR, please disable it
  because the images will be not calculated correctly. PHP doesn't use
  WEB ACCELERATOR and the images fetched are the same as the remote
  site.

- Check your memory limit for PHP – News Feeder was tested under a
  server with the value configured to 72MB with image fetching enabled
  and thus the extension ran very slow. A value of 96 MB could be right
  to work correctly. If you have not access to server configuration
  (i.e. hosting plan limited to 64 MB or less, consider to disable the
  download of images to reduce time and resources consumption).Please
  inform me if you face problems:  *at(at)uniud.it*


.. _To-Do-list:

To-Do list
----------

**SOME things to-do:**

- check for bugs under T3 6.2.X; I do not tried to downlad and reuse
  images

- test more extensively for base64 decode of the image tag

- A new menu for the BE with some infos/log about CRON mode.

- improve output messages and log for updating process

- keywords for static/dynamic sites....

- for each keyword enable or disable image fetching...

- documentation in italian language

- static/dynamic sites: add code to fetch full news and import in DB
  (long text, news type= internal)

- static/dynamic sites (not engines!) add a field to exclude undesired
  keywords

- static/dynamic sites (not engines!) add a field to relate news fetched
  to one or more news category


.. _Changelog:

Changelog
---------

- **05-10-2014 (v.3.0.1, beta) –** Minor manual modifications (Crontab
  section).

- **04-10-2014 (v.3.0.0, beta) – Now is compatible with TYPO3 6.2;
  please avoid to install for previous version. Manual updated; CRON
  must be reconfigured.**

- **05-09-2012 (v.2.7.0) -** some code changes to ensure 4.7.x
  compatibility. Not yet compatible with 6.x., review of site definition
  (now google is removed).

- **05-12-2011 (v.2.5.0) –** new site definition upgraded: google not
  supported (the record will be hidden after the upgrade). News engines:
  yahoo.com for DE, IT, EN - BING for italian

- **04-02-2011 (v.2.4.8) –** new site definition upgraded, minor bug
  fixed. Now works with dispatcher from BE

- **29-10-2009 (v. 2.3.3) - guide updated ,** new site definition
  upgraded

- **29-08-2009 (v. 2.3.2) –** documentation updated.

- **28-08-2009 (v. 2.3.1) -** modified htmlspecialchars\_decode adding
  some code to ensure compatibility with PHP < 5.1; thanks to *Andreas
  Weigelt* for discovering this “bug”.

- **01-06-2009**  **(v. 2.2.12)** – add the use of
  htmlspecialchars\_decode for the URL retrieved, unfortunately this
  features restricts the use to PHP v.5.1+

- **28-02-2009 (v.2.2.11) – Site definition updated, guide updated.**
  Google has just changed the format of HTML page and since today
  news.google.it and news.google.com have the same parameters.

- **04-05-2008 (v.2.2.2 and v.2.2.3) – guide updated** , new site
  definition

- **26-03-2008 (v.2.2.1) – guide updated** (some little mistakes)

- **22-03-2008 (v.2.2.1) – Cron mode:** pcitures downloaded perms are
  set (default) with owner/group of upload/pics; you can override this
  parameter.

- **22-03-2008 (v.2.2.0) – Cron mode now downloads correctly the
  images.**

- **10-02-2007 (v.2.1.5) – Output suppressed (debug), site definition
  updated.**

- **27-12-2007 (v.2.1.1) – Bug fixes –** Function to clear cache now
  works accepting more parameters (see reference) Property:
  mod.web\_txttnewsfeederM1.clearCachePages = allclear modified to clear
  all cache, pages and list of id.

- **23-12-2007 (v.2.0.7) – Bug fixes –** Library class modified (if
  there is only a site defined, news wasn't fetched). Italian definition
  for google.it doesn't work correctly because the URL was not defined
  correctly. Guide updated.

- **27-07-2007 (v.2.0.4) – New TS config,** LI mode report messages
  added/improved

- **28-05-2007 (v.2.0.2) – Minor bug fixes,** CLI mode report messages
  added/improved

- **22-05-2007 (v.2.0.1) – Two bug fixes –** External news not removed
  (all Modes), mail not starting in CRON mode.

- **22-05-2007 (v.2.0.0) – Major release –** Now works in CRON mode, PHP
  code has been reviewed and heavily modified; a new class introduced;
  some improvements and minor bux fixed. Guide updated for CRON and
  other.

- **06-01-2007 (v.1.2.2) –** new field for search engine/sites charset
  (please update this version and reload site/definitions)

- **06-01-2007 (v.1.2.1) –** ajusted Google.news definition; images:
  some PHP code modified to preserve colors.

- **05-01-2007 (v1.2) –** new site definitions for  *google.it news* ;
  modified some code for update, new documentation.

- **04-01-2007 (v.1.1.21) –** new site definitions for  *google.com
  news* ; modified some code for update.

- **02.01.2007 (v. 1.1.16) –** images autoresize feature impelented;
  field  *check every n days* it is not more required because a bug of
  Typo3 to testing this type of field.

- **30.12.2006 (v. 1.1.12 to v. 1.1.15) –** minor bug fixing

- **22.12.2006 (v. 1.1.11) –** add url parameter adjusted for
  static/dynamic sites to allow remote image fetching; manual upgraded,
  a message substituted; font inserted before subtitle in
  test/production mode.

- **17.12.2006 (v. 1.1.10) -** all (little) bugs connected to image
  management are removed.

- **12.12.2006 -** modified userBEowner parameter; access bug: not Admin
  users now can load, delete and remove records.

- **30.11.2006 -** new DB field to configure how every day/s start the
  check for site/engine.

- **26.11.2006 -** new TSConf parameter: useTitleIfSubIsEmpty, fills the
  subtitle with title if subtitle is empty; new TSConf parameter::
  useRandomTime, you can enable/disable this feature – if disabled the
  records will be displayed according to the fetching order and all with
  the same hour and minute; title/subtitle check: if both are empty
  record is refused. image status introduced with refused/accepted and
  bytes message; mandatory field for titles and url to exclude; better
  specified that you can use REGEXP; option DELETE for image uploaded on
  approval; messages for images accepted/refused on test and production
  mode

- **21.11.2006 -** delete expired news: corrected the code to show how
  many records to clean

- **20.11.2006 -** image downloading support

- **19.11.2006 -** stable version with site definition update feature
  implemented

- **14.11.2006 -** problem discovered: you must copy/paste the code
  *not* in your feeder-folder but in you root-page properties!

- **11.11.2006 -** new parameter for charset conversion; new function:
  load site definitions; error messages improved; checkboxes to select
  one or more sites; manual on-line updated

- **30.10.2006 -** Second version: manual upgrade, a new field
  introduced for the scheme. Minor changes, cache not cleared fixed.

- **25.10.2006 -** First version published

|img-6| EXT: news feeder - 15


.. ######CUTTER_MARK_IMAGES######

.. |img-1| image:: img-1.png
.. :align: left
.. :border: 0
.. :height: 283
.. :id: graphics4
.. :name: graphics4
.. :width: 513

.. |img-2| image:: img-2.png
.. :align: left
.. :border: 0
.. :height: 213
.. :id: graphics1
.. :name: graphics1
.. :width: 513

.. |img-3| image:: img-3.png
.. :align: left
.. :border: 0
.. :height: 363
.. :id: graphics2
.. :name: graphics2
.. :width: 478

.. |img-4| image:: img-4.png
.. :align: left
.. :border: 0
.. :height: 193
.. :id: graphics3
.. :name: graphics3
.. :width: 477

.. |img-5| image:: img-5.png
.. :align: left
.. :border: 0
.. :height: 17
.. :id: immagini2
.. :name: immagini2
.. :width: 91

.. |img-6| image:: img-6.png
.. :align: left
.. :border: 0
.. :height: 32
.. :id: Graphic1
.. :name: Graphic1
.. :width: 102