After searching with Apache's Lucene for years, Apache Solr has grown and grown and can now be called an enterprise search platform that is based on Lucene. It’s a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON or binary over HTTP. You query it via HTTP GET and receive XML, JSON, or binary results. To get a more detailed knowledge what Solr exactly is and how it works, please visit the Apache Solr project website. Searching with the powerful and flexible Apache Solr's REST-like interface will drill down the development complexity. Moreover you can rely on existing graphical interfaces that provide comfortable AJAX based search functionality to the end user of your internet/intranet application.
OpenCms (since version 8.5) integrates Apache Solr. And not only for full text search, but as a powerful enterprise search platform as well.
The documentation itself features a solr-based search facility. Watch out for the magnifier in top navigation.
Imagine you want to show a list of "all articles, that have changed since yesterday, where property 'X' has the value 'Y' :
http://localhost:8080/opencms/opencms/handleSolrSelect?
fq=type:v8article
&fq=lastmodified:[NOW-1DAY TO NOW]
&fq=Title_prop:Flower
Parameter explanation:
http://localhost:8080/opencms/opencms/handleSolrSelect
// The URI of the OpenCms Solr Select Handler configured in
// 'opencms-system.xml'
?fq=type:v8article // Filter query on the field type
// with the value 'v8article'
&fq=lastmodified:[NOW-1DAY TO NOW] // Filter query on the field lastmodified
// with a range query from 'NOW-1DAY TO NOW'
&fq=Title_prop:Flower // Filter query on the field Title_prop
// with the value 'v8article'
If you want to get familiar with the Solr query syntax you will get a general overview at Solr query syntax. For advanced features Searching - Solr Reference Guide - Lucid Imagination will lend a hand.
Please note that many characters in the Solr Query Syntax (most notable the plus sign: "+") are special characters in URLs, so when constructing request URLs manually, you must properly URL-Encode these characters.
q= +popularity:[10 TO *] +section:0
http://localhost:8983/solr/select?q=%2Bpopularity:[10%20TO%20*]%20%2Bsection:0
For more information, see Yonik Seeley's blog on Nested Queries in Solr.
You can pass any "Solr valid" input to the new OpenCms Solr request handler (handleSolrSelect). To get familiar with the Solr query syntax the Solr Wiki page lends itself: Search and Indexing.
The response produced by Solr can be XML or JSON by default. With an additional parameter 'wt' you can specify the QueryResponseWriter that should be used by Solr. For the above shown query example a result can look like this:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">7</int>
<lst name="params">
<str name="qt">dismax</str>
<str name="fl">*,score</str>
<int name="rows">50</int>
<str name="q">*:*</str>
<arr name="fq">
<str>type:v8article</str>
<str>contentdate:[NOW-1DAY TO NOW]</str>
<str>Title_prop:Flower</str>
</arr>
<long name="start">0</long>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<str name="id">51041618-77f5-11e0-be13-000c2972a6a4</str>
<str name="contentblob">[B:[B@6c1cb5</str>
<str name="path">/sites/default/.content/article/a_00003.html</str>
<str name="type">v8article</str>
<str name="suffix">.html</str>
<date name="created">2011-05-06T15:27:13Z</date>
<date name="lastmodified">2011-08-17T13:58:29Z</date>
<date name="contentdate">2012-09-03T10:41:13.56Z</date>
<date name="relased">1970-01-01T00:00:00Z</date>
<date name="expired">292278994-08-17T07:12:55.807Z</date>
<arr name="res_locales">
<str>en</str>
<str>de</str>
</arr>
<arr name="con_locales">
<str>en</str>
</arr>
<str name="template_prop">
/system/modules/com.alkacon.opencms.v8.template3/templates/main.jsp</str>
<str name="style.layout_prop">/.content/style</str>
<str name="NavText_prop">OpenCms 8 Demo</str>
<str name="Title_prop">Flower Today</str>
<arr name="content_en">
<str>News from the world of flowers Flower Today In this [...]</str>
</arr>
<date name="timestamp">2012-09-03T10:45:47.055Z</date>
<float name="score">1.0</float>
</doc>
<doc>
<str name="id">ac56418f-77fd-11e0-be13-000c2972a6a4</str>
<str name="contentblob">[B:[B@1d0e4a2</str>
<str name="path">/sites/default/.content/article/a_00030.html</str>
<str name="type">v8article</str>
<str name="suffix">.html</str>
<date name="created">2011-05-06T16:27:02Z</date>
<date name="lastmodified">2011-08-17T14:03:27Z</date>
<date name="contentdate">2012-09-03T10:41:18.155Z</date>
<date name="relased">1970-01-01T00:00:00Z</date>
<date name="expired">292278994-08-17T07:12:55.807Z</date>
<arr name="res_locales">
<str>en</str>
<str>de</str>
</arr>
<arr name="con_locales">
<str>en</str>
</arr>
<str name="template_prop">
/system/modules/com.alkacon.opencms.v8.template3/templates/main.jsp
</str>
<str name="style.layout_prop">/.content/style</str>
<str name="NavText_prop">OpenCms 8 Demo</str>
<str name="Title_prop">Flower Dictionary</str>
<arr name="content_en">
<str>The different types of flowers Flower Dictionary There are
[...]</str>
</arr>
<date name="timestamp">2012-09-03T10:45:49.265Z</date>
<float name="score">1.0</float>
</doc>
</result>
</response>
String query="fq=type:v8article&fq=lastmodified:[NOW-1DAY TO NOW]&fq=Title_prop:Flower";
CmsSolrResultList results = OpenCms.getSearchManager().getIndexSolr("Solr Online
Index").search(getCmsObject(), query);
for (CmsSearchResource sResource : results) {
String path = searchRes.getField(I_CmsSearchField.FIELD_PATH);
Date date =searchRes.getMultivaluedField(I_CmsSearchField.FIELD_DATE_LASTMODIFIED);
List<String> cats = searchRes.getMultivaluedField(I_CmsSearchField.FIELD_CATEGORY);
}
The class org.opencms.search.solr.CmsSolrResultList
encapsulates a list of 'OpenCms resource documents' (CmsSearchResource
).
The list can be accessed exactly like an ArrayList
with entries of the type CmsSearchResource
that extend the type CmsResource
and holds the Solr implementation of I_CmsSearchDocument
as member. This format enables you to deal with the results as with a well known List
and work on its entries like you do on CmsResource
.
CmsSolrQuery
-class for querying Solr
CmsSolrIndex index = OpenCms.getSearchManager().getIndexSolr("Solr Online Index");
Map parameters = new HashMap<String,String>();
parameters.put("path","/sites/default/xmlcontent/article_0001.html");
CmsSolrQuery squery = new CmsSolrQuery(getCmsObject(), parameters);
List<CmsResource> results = index.search(getCmsObject(), squery);
Solr comes with a whole bunch of features for which documentation is found in the solr wiki:
Core is the wording in the Solr world for thinking of several indexes. Preferring the correct speech, let's say core instead index. Multiple cores should only be required if you have completely different applications but want a single Solr Server that manages all the data. See Solr Core Administration for detailed information. So assuming you have configured multiple Solr cores and you would like to query a specific one you have to tell Solr/OpenCms which core/index you want to search on. This is done by a special parameter:
http://localhost:8080/opencms/opencms/handleSolrSelect?
// The URI of the OpenCms Solr Select Handler
// configured in 'opencms-system.xml'
&core=My Solr Index Name // Searches on the core with the name 'My Solr Index Name'
&q=content_en:Flower // for the text 'Flower'
OpenCms (since version 8.5) delivers a standard Solr collector using byQuery
as name to simply pass a query string and byContext
as name to pass a query string and led OpenCms use the user's request context. The implementing class for this collector can be found at org.opencms.file.collectors.CmsSolrCollector
.
<cms:contentload collector="byQuery" preload="true"
param='fq=parent-folders:"/sites/default/"&fq=type:ddarticle&sort=lastmodified desc'>
<cms:contentinfo var="info" />
<c:if test='${info.resultSize != 0}'>
<cms:contentinfo var="info" />
<c:if test='${info.resultSize != 0}'>
<h3>Solr Collector Demo</h3>
<cms:contentload editable="false">
<cms:contentaccess var="content" />
<%-- Title of the article --%>
<h6>${content.value.Title}</h6>
<%-- The text field of the article with image --%>
<div class="paragraph">
<%-- Set the requied variables for the image. --%>
<c:if test="${content.value.Image.isSet}">
<%-- Output of the image using cms:img tag --%>
<c:set var="imgwidth">${(cms.container.width - 20) / 3}</c:set>
<%-- Output the image. --%>
<cms:img src="${content.value.Image}" />
</c:if>
${cms:trimToSize(cms:stripHtml(content.value.Text), 300)}
</div>
<div class="clear"></div>
</cms:contentload>
</c:if>
</c:if>
</cms:contentload>
In general the system wide search configuration for OpenCms is done in the file opencms-search.xml
(<CATALINA_HOME>/webapps/<OPENCMS_WEBAPP>/WEB_INF/config/opencms-search.xml
).
Since version 8.5 of OpenCms a new optional node with the XPath: opencms/search/solr
is available. To simply enable the OpenCms embedded Solr Server your opencms-search.xml
should start like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE opencms SYSTEM "http://www.opencms.org/dtd/6.0/opencms-search.dtd">
<opencms>
<search>
<solr enabled="true"/>
[...]
</search>
</opencms>
Optionally you can configure the Solr home directory and the main Solr configuration file name (default: solr.xml
). OpenCms then concatenates those two paths to <solr_home>/<configfile>
an example for such a configuration would look like:
<solr enabled="true">
<home>/my/solr/home/folder</home>
<configfile>rabbit.xml</configfile>
</solr>
In order to disable Solr system wide remove the <solr/>
-node or set the enabled attribute to false
like:
<solr enabled="false"/>
It is also possible to connect with an external HTTP Solr server, to do so replace the line <solr enabled="true"/>
with the following:
<solr enabled="true" serverUrl="http://mySolrServer" />
The OpenCms SolrSelect request handler does not support the external HTTP Solr Server. So if your HTTP Solr Server is directly reachable by http://<your_server>
there will no permission check performed and indexed data that is secret will be accessible. What means that you are self-responsible for resources that have permission restrictions set on the VFS of OpenCms. But of course you can use the method
org.opencms.search.solr.CmsSolrIndex.search(CmsObject, SolrQuery)
or
org.opencms.search.solr.CmsSolrIndex.search(CmsObject, String)
and be sure permissions are checked also for HTTP Solr Servers. Maybe a future version of OpenCms will feature a secure access on HTTP Solr server.
By default OpenCms comes along with a "Solr Online" index. To add a new Solr index you can use the default configuration as copy template.
<index class="org.opencms.search.solr.CmsSolrIndex">
<name>Solr Online</name>
<rebuild>auto</rebuild>
<project>Online</project>
<locale>all</locale>
<configuration>solr_fields</configuration>
<sources>
<source>solr_source</source>
</sources>
</index>
Index sources for Solr can be configured in the file opencms-search.xml
exactly the same way as you do for Lucene indexes. In order to use the advanced XSD field mapping for XML contents, you must add the new document type xmlcontent-solr
to the list of document types that are indexed:
<indexsource>
<name>solr_source</name>
<indexer class="org.opencms.search.CmsVfsIndexer" />
<resources>
<resource>/sites/default/</resource>
</resources>
<documenttypes-indexed>
<name>xmlcontent-solr</name>
<name>containerpage</name>
<name>xmlpage</name>
<name>text</name>
<name>pdf</name>
<name>image</name>
<name>msoffice-ole2</name>
<name>msoffice-ooxml</name>
<name>openoffice</name>
</documenttypes-indexed>
</indexsource>
xmlcontent-solr
With OpenCms version 8.5 there is a new document type called xmlcontent-solr
. Its implementation (CmsSolrDocumentXmlContent
) performs a localized content extraction that is used later on to fill the Solr input document. As explained in the section about custom fields for XML content, it is possible to define a mapping between elements defined in the XSD of an XML resource type and a field of the Solr document. The values for those defined XSD field mappings are also extracted by the document type named xmlcontent-solr
.
<documenttype>
<name>xmlcontent-solr</name>
<class>org.opencms.search.solr.CmsSolrDocumentXmlContent</class>
<mimetypes>
<mimetype>text/html</mimetype>
</mimetypes>
<resourcetypes>
<resourcetype>xmlcontent-solr</resourcetype>
</resourcetypes>
</documenttype>
By default the field configuration for OpenCms Solr indexes is implemented by the class org.opencms.search.solr.CmsSolrFieldConfiguration
. The easiest Solr field configuration declared in opencms-search.xml
looks as follows. See also the section about extending the CmsSolrFieldConfiguration
.
<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
<name>solr_fields</name>
<description>The Solr search index field configuration.</description>
<fields />
</fieldconfiguration>
An existing Lucene field configuration can easily be transformed into a Solr index. To do so, create a new Solr field configuration. As template, you can use the snippet shown in section about the Solr default field configuration. Just copy the list of fields from the Lucene index you want to convert into that skeleton.
There exists a specific strategy to map the Lucene field names to Solr field names:
<field name="meta"> ... </field>
. To make use of this strategy you have to edit the schema.xml
of Solr manually and add an explicit field definition named according to the exact Lucene field names.schema.xml
defines different data types for fields. If you are interested in making use of these type specific advantages (like language specific field analyzing/tokenizing) without manipulating the schema.xml
of Solr, you have to define a type attribute for those fields at least. The value of the attribute type can be any name of each <dynamicField>
configured in the schema.xml
that starts with a *_
. The resulting field inside the Solr document is then named <luceneFieldName>_<dynamicFieldSuffix>
.schema.xml
named according to the Lucene field name OpenCms uses text_general
as fallback. E.g. a Lucene field <field name="title" index="true"> ... </field>
will be stored as a dynamic field named title_txt
in the Solr index.An originally field configuration as follows:
<fieldconfiguration>
<name>standard</name>
<description>The standard OpenCms 8.0 search field configuration.</description>
<fields>
<field name="content" store="compress" index="true" excerpt="true">
<mapping type="content"/>
</field>
<field name="title-key" display="-" store="true" index="untokenized" boost="0.0">
<mapping type="property">Title</mapping>
</field>
<field name="title" display="%(key.field.title)" store="false" index="true">
<mapping type="property">Title</mapping>
</field>
<field name="keywords" display="%(key.field.keywords)" store="true" index="true">
<mapping type="property">Keywords</mapping>
</field>
<field name="description" store="true" index="true">
<mapping type="property">Description</mapping>
</field>
<field name="meta" display="%(key.field.meta)" store="false" index="true">
<mapping type="property">Title</mapping>
<mapping type="property">Keywords</mapping>
<mapping type="property">Description</mapping>
</field>
</fields>
</fieldconfiguration>
could look after conversion like this:
<fieldconfiguration class="org.opencms.search.solr.CmsSolrFieldConfiguration">
<name>standard</name>
<description>The standard OpenCms 8.0 Solr search field configuration.</description>
<fields>
<field name="content" store="compress" index="true" excerpt="true">
<mapping type="content"/>
</field>
<field name="title-key" store="true" index="untokenized" boost="0.0" type="s">
<mapping type="property">Title</mapping>
</field>
<field name="title" store="false" index="true" type="prop">
<mapping type="property">Title</mapping>
</field>
<field name="keywords" store="true" index="true" type="prop">
<mapping type="property">Keywords</mapping>
</field>
<field name="description" store="true" index="true" type="prop">
<mapping type="property">Description</mapping>
</field>
<field name="meta" store="false" index="true" type="en">
<mapping type="property">Title</mapping>
<mapping type="property">Keywords</mapping>
<mapping type="property">Description</mapping>
</field>
</fields>
</fieldconfiguration>
The following sections will show what data is indexed by default and what possibilities are offered by OpenCms to configure / implement additional field configurations / mappings.
schema.xml
)
Have a look at the Solr schema.xml
first. In the file <CATALINA_HOME>/webapps/<OPENCMS>/WEB-INF/solr/conf/schema.xml
you will find the field definitions that will be used by OpenCms that were briefly summarized before.
OpenCms indexes for each resource by default the following fields:
id
Structure id used as unique identifier for a document (The structure id of the resource).
path
Full root path (The root path of the resource, e.g., /sites/default/flower_en/.content/article.html
)
path_hierarchy
The full path as (path tokenized field type: text_path
).
parent-folders
Parent folders (multi-valued field containing an entry for each parent path as root path).
type
Type name (the resource type name).
res_locales
Existing locale nodes for XML content and all available locales in case of binary files.
created
The creation date (The date when the resource itself has being created).
lastmodified
The date last modified (The last modification date of the resource itself).
contentdate
The content date (The date when the resource's content has been modified).
released
The release and expiration date of the resource.
content
A general content field that holds all extracted resource data (all languages, type text_general).
contentblob
The serialized extraction result (content_blob) to improve the extraction performance while indexing.
category
All categories as general text.
category_exact
All categories as exact string for faceting reasons.
text_
Extracted textual content optimized for the language specific search (Default languages: en, de, el, es, fr, hu, it).
timestamp
The time when the document was indexed last time.
*_prop
All properties of a resource as searchable and stored text (field name: <Property_Definition_Name>_prop
as text_general
).
*_exact
All properties of a resource as exact not stored string (field name: <Property_Definition_Name>_exact
as string
)
You are able to declare search field mappings for XML content elements directly in the XSD Content Definition by using the element <searchsettings>
. A XSD using this feature can then look like:
<searchsettings>
<searchsetting element="Title" searchcontent="true">
<solrfield targetfield="atitle">
<mapping type="property">Author</mapping>
</solrfield>
</searchsetting>
<searchsetting element="Teaser">
<solrfield targetfield="ateaser">
<mapping type="item" default="Homepage n.a.">Homepage</mapping>
<mapping type="content"/>
<mapping type="property-search">search.special</mapping>
<mapping type="attribute">dateReleased</mapping>
<mapping type="dynamic"
class="org.opencms.search.solr.CmsDynamicDummyField">special
</mapping>
</solrfield>
</searchsetting>
<searchsetting element="Text" searchcontent="true">
<solrfield targetfield="ahtml" boost="2.0"/>
</searchsetting>
<searchsetting element="Release" searchcontent="false">
<solrfield targetfield="arelease" sourcefield="*_dt" />
</searchsetting>
<searchsetting element="Author" searchcontent="true">
<solrfield targetfield="aauthor" locale="de"
copyfields="test_text_de,test_text_en" />
</searchsetting>
<searchsetting element="Homepage" searchcontent="true">
<solrfield targetfield="ahomepage" default="Homepage n.a." />
</searchsetting>
</searchsettings>
The element searchsetting
is used to declare the source (XSD content elements) and its corresponding destination (Solr fields) of the mapping. The attribute element
is used to specify the XSD content element that should be mapped to a Solr field.
In order to specify the Solr destination field, the child element <solrfield>
is used. The following list contains its possible attributes.
<solrfield>
element
targetfield (required) |
The attribute |
locale (optional) |
As previously explained, the content is written for every locale that defines content in the particular XML document. This parameter can be used to change this default behavior by only writing the locale passed by this parameter and ignoring possible existing content in other locales. |
sourcefield (optional) |
If this attribute is used, the resulting Solr field name will be |
copyfields (optional) |
The attribute |
default (optional) |
This attribute sets a default value for the field that is used in the case the appropriate XML content field is empty. |
boost (optional) |
Sets a boost to the resulting Solr field. See also in the Solr wiki. |
The element <searchsetting>
has an optional child element named <mapping>
which can be used to map resource properties, content items and others to the target Solr field. These mapped values are appended to the Solr field, therefore it's possible to have more than one single occurrence of the <mapping>
element within <solrfield>.
The attribute default
is used to specify a default value in case the mapping is not able to extract the desired information.
The type
attribute specifies which extraction method is used. Accepted values for the attribute type
are:
type
-attribute of <mapping>
-element
item |
Map a structured content item to a Solrfield. |
attribute |
Maps a resource attribute to a Solrfield. Possible arguments are |
content |
Map the XML content to the target field. This value expects no argument. |
property |
Map the value of a resource property to the Solr target field. |
property-search |
Search the parents of the resource for the value of the passed resource property and map this value to the Solr target field. |
dynamic |
Use an instance of the interface |
Declarative field configuration with field mappings can also be done via the XSD-Content-Definition of an XML resource type as defined in the DefaultAppinfoTypes.xsd
:
<xsd:complexType name="OpenCmsDefaultAppinfoSearchsetting">
<xsd:sequence>
<xsd:element name="solrfield"
type="OpenCmsDefaultAppinfoSolrField"
minOccurs="0" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="element" type="xsd:string" use="required" />
<xsd:attribute name="searchcontent"
type="xsd:boolean" use="optional" default="true" />
</xsd:complexType>
<xsd:complexType name="OpenCmsDefaultAppinfoSolrField">
<xsd:sequence>
<xsd:element name="mapping"
type="OpenCmsDefaultAppinfoSolrFieldMapping"
minOccurs="0" maxOccurs="unbounded" />
</xsd:sequence>
<xsd:attribute name="targetfield" type="xsd:string" use="required" />
<xsd:attribute name="sourcefield" type="xsd:string" use="optional" />
<xsd:attribute name="copyfields" type="xsd:string" use="optional" />
<xsd:attribute name="locale" type="xsd:string" use="optional" />
<xsd:attribute name="default" type="xsd:string" use="optional" />
<xsd:attribute name="boost" type="xsd:string" use="optional" />
</xsd:complexType>
Declarative field configurations with field mappings can be defined in the file opencms-search.xml. You can use exactly the same features as already known for OpenCms Lucene field configurations.
Please see the section about migrating a Lucene index to a Solr index.
CmsSolrFieldConfiguration
If the standard configuration options are still not flexible enough you are able to extends from the class: org.opencms.search.solr.CmsSolrFieldConfiguration
and define a custom Solr field configuration in the opencms-search.xml
:
<fieldconfiguration class="your.package.YourSolrFieldConfiguration">
<name>solr_fields</name>
<description>The Solr search index field configuration.</description>
<fields/>
</fieldconfiguration>
The class org.opencms.main.OpenCmsSolrHandler
offers the same functionality as the default select request handler of an standard Solr server installation. In the OpenCms default system configuration (opencms-system.xml
) the Solr request handler is configured:
<requesthandlers>
<requesthandler class="org.opencms.main.OpenCmsSolrHandler" />
</requesthandlers>
Alternatively the request handler class can be used as Servlet, therefore add the handler class to the WEB-INF/web.xml
of your OpenCms application:
<servlet>
<description>
The OpenCms Solr servlet.
</description>
<servlet-name>OpenCmsSolrServlet</servlet-name>
<servlet-class>org.opencms.main.OpenCmsSolrHandler</servlet-class>
<load-on-startup>1</load-on-startup>
</servlet>
[...]
<servlet-mapping>
<servlet-name>OpenCmsSolrServlet</servlet-name>
<url-pattern>/solr/*</url-pattern>
</servlet-mapping>
OpenCms performs a permission check for all resulting documents and throws those away that the current user is not allowed to retrieve and expands the result for the next best matching documents on the fly. This security check is very cost intensive and should be replaced/improved with a pure index based permission check.
OpenCms offers the capability for post search processing Solr documents after the document has been checked for permissions. This capability allows you to add fields to the found document before the search result is returned. In order to make use of the post processor you have to add an optional parameter for the search index as follows:
<index class="org.opencms.search.solr.CmsSolrIndex">
<name>Solr Offline</name>
<rebuild>offline</rebuild>
<project>Offline</project>
<locale>all</locale>
<configuration>solr_fields</configuration>
<sources>
[...]
</sources>
<param name="search.solr.postProcessor">
my.package.MyPostProcessor
</param>
</index>
The specified class for the parameter search.solr.postProcessor
must be an implementation of org.opencms.search.solr.I_CmsSolrPostSearchProcessor
.
There is a default strategy implemented for the multi-language support within OpenCms Solr search index. For binary documents the language is determined automatically based on the extracted text. The default mechanism is implemented with: http://code.google.com/p/language-detection/.
For XML contents we have the concrete language/locale information and the localized fields are ending with underscore followed by the locale. E.g.: content_en
, content_de
or text_en
, text_de
. By default all the field mappings definied within the XSD of a resource type are extended by the _<locale>
.