Indexing External Content

How to use the cache macro to index external content

This means that user can use Confluence search to find information that comes from external sources including databases and web pages.

Beta

This support is only available by downloading and installing the following beta version of the cache plugin. It will be available in the plugin repository when the beta is complete.

Beta jar - 3.2.0-b1.jar

New parameter

The following additional parameter is available on the cache macro

index - If index=true, the cached content will be added to the Confluence search index. Default is "false". Whenever the cached data is updated, the page will be re-indexed.

Examples

SQL queries

{cache:index=true}
The results from the SQL query will be indexed for search.
{sql-query:dataSource=ReportDS}
select * from report
{sql}
{cache}

Web pages

{cache:index=true}
The page pointed to by the url will be indexed for search.
{html:script=#http://www.atlassian.com/about/}
{html}
{cache}

How does it work?

Assume there is a page that has a cache macro instance that specifies index=true

Whenever the page is viewed, then either
- cache is valid (cache is current according to the refresh parameter and the age of the cached data) - the cached data is used as normal
- cache is expired - the data within the cache macro is rendered and the rendered data is stored in the cache as normal. In addition, the page is flagged as needing to be re-indexed since the dynamic content of the page has changed due to the new rendering.
The standard indexing queue is processed as normal by all the registered content extractors. The cache macro has added a new content extractor to process pages with cached content.
- The page will be recognized by the cache macro extractor and cause the appropriate cached content to be added to the content that will be indexed for the page

What data is indexed?

The cache macro renders contents of the macro into HTML and stores the HTML in the cache. The cache content extractor processes the HTML data from the cache and extracts only the text and attribute fields using the Jericho HTML Parser.

Considerations

Additional processing required for each time the cache is refreshed for a page. Specifically, the page will be indexed occur more often than before
No change in behavior when the new index feature is not used
An additional index extractor is involved in indexing operations
- Very low overhead for pages that do not use new feature
- Can be disabled for the site from the plugin screen