A Solr schema defines the relationship between data in a column family and a Solr core. The schema identifies the columns to index in Solr and maps column names to Solr types. This document describes the Solr schema at a high level. For details about all the options and Solr schema settings, see the Solr wiki.
Wikipedia Sample Schema Elements
The sample schema.xml for the Wikipedia demo represents a typical schema. It specifies a tokenizer that determines the parsing of the wiki text. The set of fields specifies what Solr indexes and stores. In this example, these name, body, title, and date fields are indexed.
<schema name="wikipedia" version="1.1">
<types>
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text" class="solr.TextField">
<analyzer><tokenizer class="solr.WikipediaTokenizerFactory"/></analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="name" type="text" indexed="true" stored="true"/>
<field name="body" type="text" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="date" type="string" indexed="true" stored="true"/>
</fields>
<defaultSearchField>body</defaultSearchField>
<uniqueKey>id</uniqueKey>
The example schema.xml meets the requirement to have a unique key and no duplicate rows. The unique key maps to the row key and is necessary for DSE to route documents to cluster nodes. This unique key is like a primary key in SQL. The last element in the schema.xml example designates that the unique key is id. In a DSE Search/Solr schema, the value of the stored attribute of non-unique fields needs to be true; True causes the field to stored in Cassandra. The field does not show up in search results.
Changing the Solr schema makes reloading the Solr core necessary. Re-indexing can be disruptive. Users can be affected by performance hits caused by re-indexing. Changing the schema is recommended only when absolutely necessary. Also, changing the schema during scheduled down time is recommended.
After indexing the Wikipedia articles, Cassandra columns in the column family contain metadata corresponding to the fields listed in the demo schema. The output of the CLI command, DESCRIBE wiki, shows this metadata:
Column Name: body
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Index Name: wiki_solr_body_index
Index Type: CUSTOM
Index Options: {class_name=com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex}
Column Name: date
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Index Name: wiki_solr_date_index
Index Type: CUSTOM
Index Options: {class_name=com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex}
Column Name: name
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Index Name: wiki_solr_name_index
Index Type: CUSTOM
Index Options: {class_name=com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex}
Column Name: solr_query
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Index Name: wiki_solr_solr_query_index
Index Type: CUSTOM
Index Options: {class_name=com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex}
Column Name: title
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Index Name: wiki_solr_title_index
Index Type: CUSTOM
Index Options: {class_name=com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex}
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Column metadata matches each field in the schema except the id field because id is the unique key. The column metadata example shows some of the Cassandra Validator types in the Validation Class attribute.
DataStax Enterprise 3.0 and earlier releases use legacy mapping of Solr types to Cassandra validator types. In DataStax Enterprise 3.0.1 and later, this mapping is used:
| Solr Type | Cassandra Validator | Description |
|---|---|---|
| BCDIntField | Int32Type | Binary-coded decimal (BCD) integer. BCD is a relatively inefficient encoding that offers the benefits of quick decimal calculations and quick conversion to a string. |
| BCDLongField | LongType | BCD long integer |
| BCDStrField | UTF8Type | BCD string |
| BinaryField | BytesType | Binary data |
| BoolField | BooleanType | Contains either true or false. Values of "1", "t", or "T" in the first character are interpreted as true. Any other values in the first character are interpreted as false. |
| ByteField | Int32Type | Contains an 8-bit number value. |
| DateField | DateType | Represents a point in time with millisecond precision. |
| DoubleField | DoubleType | Double (64-bit IEEE floating point) |
| ExternalFileField | UTF8Type | Pulls values from a file on disk. See the section below on working with external files. |
| FloatField | FloatType | Floating point (32-bit IEEE floating point) |
| IntField | Int32Type | Integer (32-bit signed integer) |
| LongField | LongType | Long integer (64-bit signed integer) |
| RandomSortField | UTF8Type | Does not contain a value. Queries that sort on this field type will return results in random order. Use a dynamic field to use this feature. |
| ShortField | Int32Type | Short integer |
| SortableDoubleField | DoubleType | The Sortable* fields provide correct numeric sorting. If you use the plain types (DoubleField, IntField, and so on) sorting will be lexicographical instead of numeric. |
| SortableFloatField | FloatType | Numerically sorted floating point |
| SortableIntField | Int32Type | Numerically sorted integer |
| SortableLongField | LongType | Numerically sorted long integer |
| StrField | UTF8Type | String (UTF-8 encoded string or Unicode) |
| TextField | UTF8Type | Text, usually multiple words or tokens |
| TrieDateField | DateType | Date field accessible for Lucene TrieRange processing |
| TrieDoubleField | DoubleType | Double field accessible for Lucene TrieRange processing |
| TrieField | see description | Used with a type attribute and value: integer, long, float, double, date. Same as using any of the Trie field types, such as TrieIntField. |
| TrieFloatField | FloatType | Floating point field accessible for Lucene TrieRange processing |
| TrieIntField | Int32Type | Int field accessible for Lucene TrieRange processing |
| TrieLongField | LongType | Long field accessible for Lucene TrieRange processing |
| UUIDField | UUIDType | Universally Unique Identifier (UUID). Using a value of NEW and Solr creates a new UUID. |
| LatLonType | UTF8Type | Latitude/Longitude as a 2 dimensional point. Latitude is always specified first. |
| PointType | UTF8Type | For spatial search: An arbitrary n-dimensional point, useful for searching sources such as blueprints or CAD drawings. |
| GeoHashField | UTF8Type | Representing a Geohash. The field is provided as a lat/lon pair and is internally represented as a string |
In DataStax Enterprise 3.0 and earlier, Solr types map to these Cassandra validator types:
| Solr Type | Cassandra Validator |
|---|---|
| TextField | UTF8Type |
| StrField | UTF8Type |
| LongField | LongType |
| IntField | Int32Type |
| FloatField | FloatType |
| DoubleField | DoubleType |
| DateField | UTF8Type |
| ByteField | BytesType |
| BinaryField | BytesType |
| BoolField | UTF8Type |
| UUIDField | UUIDType |
| TrieDateField | UTF8Type |
| TrieDoubleField | UTF8Type |
| TrieField | UTF8Type |
| TriFloatField | UTF8Type |
| TriIntField | UTF8Type |
| TrieLongField | UTF8Type |
| All Others | UTF8Type |
For efficiency in operations such as range queries, using Trie types is recommended.
By default, DataStax Enterprise 3.0.x enables legacy type mapping (dseTypeMappingVersion is set to 0).
To make the new Solr type mappings effective, add the following line to the Solr config:
<dseTypeMappingVersion>1</dseTypeMappingVersion>
Switching between the two versions is not supported. Attempting to load a solrconfig with a different dseTypeMappingVersion configuration and reloading the core causes an error.
Contrary to the examples shown in the solrconfig.xml indicating that relative paths are supported, DataStax Enterprise does not support the relative path values set for the <lib> property. DSE Search/Solr fails to find files placed in directories defined by the <lib> property. The workaround is to place custom code or Solr contrib modules in these directories: