.. ================================================== .. FOR YOUR INFORMATION .. -------------------------------------------------- .. -*- coding: utf-8 -*- with BOM. .. include:: ../Includes.txt .. _analysing-indexed-data: Analysing the indexed data -------------------------- The indexer is constructed to work with TYPO3's page structure. Opposite to a crawler which simply indexes all the pages it can find, the TYPO3 indexer MUST take the following into account: - Only cached pages can be indexed.Pages with dynamic content - such as search pages etc - should supply their own search engine for lookup in specific tables. Another option is to selectively allow certain of those "dynamic" pages to be cached anyway (see the cHashParams concept used by some plugins) - Pages in more than one language must be indexed separately as "different pages". - Pages with message boards may have multiple indexed versions based on what is displayed on the page: The overview or a single message board item? This is determined by the cHashParams value. - Pages with access restricted to must be observed! - Because pages can contain different content whether a user is logged in or not and even based on which groups he is a member of, a single page (identified by the combination of id/type/language/cHashParams) may even be available in more than one indexed version based on the user-groups. But while the same page may have different content based on the user-groups (and so must be indexed once for each) such pages may just as well present the SAME content regardless of usergroups! This is the very most tricky thing. .. _complex-scenarios: Understanding these complex scenarios... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The best thing to do is to grab an example. Please refer to the picture below while reading the bulletlist here: #. The overview in general shows **one line per "phash-row"** (a single row from the index\_phash table). Such a row represents a single hit in a searching session. In other words, each line with grayish background in the overview may be a search-hit. The columns of these rows are: - Title: The search-result title. - [icon]: Click here to remove the indexed information for this entry (will be re-indexed on the next hit). - pHash: The "id" of the search row. The hash is calculated based on id/type/language/MP/cHashParams/gr\_list of the page when indexed. For external media this is based on filepath/page-interval (for PDF's only) - cHash: Calculated based on the actual content which was indexed. - rl-012: This is the rootline ids for level 0,1,2. Used when searching in certain sections. For instance a search-operation may select all pages with rl1=123 which will result in a search within pages which exist ONLY in the branch of the website where the level1-page has uid=123. - pid.t.l: This is the page-id / type-number / sys\_language uid - Size: How many bytes the indexed page consumed - grlist: This is the gr\_list of the user which initiated the indexing operation. - cHashParams: Additional parameters which are identifying the page in addition to the id/type number which usually does that. #. The page "Content elements" has one indexed version. The page-id of the root-page is "1" and the page on level-1 in the rootline had the uid "2". Notice how all subpages to "Content elements" has the exact same rl0 and rl1 value. Where the page "Content elements" does NOT have a value for rl2 so does all the subpages (because they ARE the level 2 themselves!). Furthermore the page has the page-id "2", a type-value of "0" and is indexed with the default language "0". The size was 10.6 KB and the user who initiated the indexing operation was a member of the groups 0,-2,1 (which is effectively fe\_group "1", because 0 and -2 is pseudogroups). #. On the page "Special content" there must have been a link to a local PDF and Word file, since those two are indexed in relation to this page. The PDF-file is located in the path "uploads/media/tsref\_onepage.pdf" relative to the website. Notice that the PDF file is actually indexed three times, one time per page. This is of course configurable. Each indexed section of the PDF-file has the potential to show up as a search-result row of course (because the phash is different per indexed part). The whole point with this is that a large PDF file might contain so much information that it might match all too many search-queries. So breaking a PDF-file down into smaller parts makes it possible for us to indicate exactly WHERE in the PDF-file the search word was found! #. Looking at the word file (and the PDF-file as well) we see that they are found on BOTH the page "Special content" and on the page "ISEARCH example". But looking at the phash values (for the word-file it is "268192666") it is the SAME value in both cases. So this means, that the Word and PDF file is indexed only once - when it is first discovered! Later when another page is indexed and a link to the same document appears, then the document is not indexed as another document, but rather an entry in the index\_section table is made indicating that this result row is also found available (linked to) from another page/section.Say you are doing a search in the section from "Content elements" and outwards in the page tree. The word- document is matched in the search, but it will appear only once in the search result. Now, if one of the two pages where the Word document was either hidden or access restricted the word-document would still be matched (because one of the pages is accessible for the user). But if BOTH pages with the link to the word document is not accessible for the user doing the search, then the word document will not be included in the search result. #. Here we can see that the pages "Special content", "Advanced" and "Menu/Sitemap" is indexed twice each. The reason is that those three pages has had different content depending on whether or not a user was logged in!In the case of the page "Special content" the reason is that the page contained a content element which was visible for users which was a member of group number #1. Therefore the page was different in the two cases.The page "Advanced" has a user-login form and that form looks different whether a user is logged in or not.Finally the page "Menu/Sitemap" apparently changed. There reason was that this page includes a sitemap and that sitemap displayed some extra pages when the logged in users hit the page and so the content was not the same as without login.Another thing which is interesting is that two different users must have visited those pages. We can see that because the page "Special content" was apparently indexed with the usergroup combination "1,2". Later another user hit the page but only a member of group "1". However the page content was the SAME. And because those two users saw the very same page, it was not indexed a third time, but it was instead noted down that a user with membership of only group "1" did also see this same page. That comparison was based on the cHash (contentHash) which is a hash-value based on the actual content being indexed. So when the user with group "1" only came to the page, the indexer engine realize that the page as it looked has already been indexed because another phash-row with that content hash was already available. #. These pages does not contain any tricks it appears. According to the grlist's both users with membership of group "1,2" and group "1" only *as well* as surfers who did not at all login ("0,-1" is the pseudo- group for no login) as visited the page. And because only one indexed version exist the page must have had the same content to present all users regardless of their login-status.The reason why the page "Your own scripts" does not contain a grlist value "0,-2,1,2" as the others do is simply because no user with that combination of usergroups has ever visited the page! #. txt and html documents can also be indexed as external media. In the case of HTML-documents the documents