How to index external content using the CACHE macro

Description

There are many macros available that allow for external content to be displayed in Confluence. For instance, macros from SQL for Confluence, HTML for Confluence, and Excel Plugin for Confluence. Even though the content appears on a Confluence page, the data will not be included in the Confluence search index since it is not included in the wiki markup of the page. One way to include the content so that it is searchable in Conflunece is to use the cache macro. How to index external content using the RUN macro is an alternative way to do this. The advantage of using the cache macro technique is that the search result returned will be the actual page the content is on.

Considerations

The content should be independent of the user accessing the page, otherwise the content will change depending on who last visited the page and caused the cache to be updated
Consider automating the updating of the content by using the Confluence Command Line Interface (CLI) to render the page on a regular basis
Additional processing is required for each time the cache is refreshed. Specifically, the page will be indexed more often than before
No change in behavior of the cache macro when the new index feature is not used
An additional index extractor is involved in indexing operations
- Very low overhead for pages that do not use new feature
- Can be disabled for the site from the plugin screen
Requires Cache plugin release 4.0 or above supporting Confluence 3.1 and above. For earlier Confluence releases, a earlier beta can be used.

New parameter

The following additional parameter is available on the cache macro

index - If index=true, the cached content will be added to the Confluence search index. Default is "false". Whenever the cached data is updated, the page will be re-indexed.

Examples

SQL queries

{cache:index=true}
The results from the SQL query will be indexed for search.
{sql-query:dataSource=ReportDS}
select * from report
{sql}
{cache}

Web pages

Using the HTML for Confluence.

{cache:index=true}
The page pointed to by the url will be indexed for search.
{html:script=#http://www.atlassian.com/about/}
{html}
{cache}

How does it work?

Assume there is a page that has a cache macro instance that specifies index=true

Whenever the page is viewed, then either
- cache is valid (cache is current according to the refresh parameter and the age of the cached data) - the cached data is used as normal
- cache is expired - the data within the cache macro is rendered and the rendered data is stored in the cache as normal. In addition, the page is flagged as needing to be re-indexed since the dynamic content of the page has changed due to the new rendering.
The standard indexing queue is processed as normal by all the registered content extractors. The cache macro has added a new content extractor to process pages with cached content.
- The page will be recognized by the cache macro extractor and cause the appropriate cached content to be added to the content that will be indexed for the page

What data is indexed?

The cache macro renders contents of the macro into HTML and stores the HTML in the cache. The cache content extractor processes the HTML data from the cache and extracts only the text and attribute fields using the Jericho HTML Parser.