本文是《Solr In Action》读书笔记，包含第3章
Chpater 03. Key Solr Concepts
A document is a collection of fields that map to particular field types defined in a schema.
Solr is a document storage and retrieval engine. Every piece of data submitted to Solr for processing is a document.
Maps each word/term in the corpus to all of the documents in which it appears.
determin textually similar words, understand synonyms, remove unimportant words, score each result… Solr accomplishes all of this by using an index that maps content to documents instead of mapping documents to content as in a traditional database model.
Required Terms: new AND house; +new +house
Optional Terms: new house; new OR house. (default)
Negated Terms: new house NOT rental; new house -rental
Phrases: “new home” OR “new house”
Grouped expressions: New AND (house OR home)
Term Position: recording of the relative position of terms within a document. can tell us where in the document each term appears.
Fuzzy matching is defined as the ability to perform inexact matches on terms in the search index. For example,
search for any words that start with a particular prefix.
1 2 3
- the more characters you specify at the beginning of the term before the wildcard, the faster the query should run.
find spelling variations within one or two characters (handle spelling errors).
Solr provides the ability to handle character variations using edit-distance measurements based upon Damerau-Levenshtein distances, which account for more than 80% of all human misspellings.
edit distanceis defined as an insertion, a deletion, a substitution, or a transposition of characters.
match two terms within some maximum distance of each other. e.g. search:
tem positions来计算edit distance.
Solr calculaes a
relevancy score for each document and then sorting the search results from the high- est score to the lowest.
- Solr’s relevancy scores are based upon the
Similarityclass, which can be defined on a per-field basis in Solr’s
Similarityis a Java class that defines how a relevancy score is calculated based upon the results of a query.
- First, it makes use of a
Boolean modelto filter out any documents that do not match the customer’s query.
- Then it uses a
vector space modelfor scoring and drawing the query as a vector, as well as an additional vector for each document.
- The similarity score for each document is based upon the
cosinebetween the query vector and that document’s vector.
- term frequency (
- inverse document frequency (
- term boosts (
- field normalization (
- coordination factor (
- query normalization (
Term frequency (tf)is a measure of how often a particular term appears in a matching document.
- The more times the search term appears within a document, the more relevant that document is considered.
Inverse Document Frequency
Inverse document frequency (idf), a measure of how “rare” a search term is, is calcu- lated by finding the document frequency (how many total documents the search term appears within), and calculating its inverse.
inverse document frequency, when multiplied together in the relevancy calculation, provide a nice counterbalance :
term frequencyelevates terms that appear multiple times within a document,
- whereas the
inverse document frequencypenalizes those terms that appear commonly across many documents.
If you have domain knowledge about your content—you know that certain fields or terms are more (or less) important than others—you can supply boosts at either indexing time or query time to ensure that the weights of those fields or terms are adjusted accordingly.
- Query-time boosting
- Index-time boosting
it’s possible to boost documents or fields within documents at index time
The default Solr relevancy formula calculates three kinds of normalization factors (norms):
query norms, and the
field normalization factor (field norm)is a combination of factors describing the importance of a particular field on a per-document basis.
This byte packs a lot of information:
- the boost set on the document when indexed,
- the boost set on the field when indexed,
- and a length normalization factor that penalizes longer documents and helps shorter documents
Query Norm does not affect the overall relevancy ordering, as the same queryNorm is applied to all documents.
It merely serves as a normalization factor to attempt to make scores between queries comparable.
The Coord Factor
Its role is to measure how much of the query each document matches.
Precision and Recall
- Precision is a measure of how “good” each of the results of a query is,
- but it pays no attention to how thorough it is; *a query that returns one single correct doc- ument out of a million other correct documents is still considered perfectly precise. *
- Precision is answering the question: “Were the documents that came back the ones I was looking for?”
- Recall is a measure of how thorough the search results are.
- Recall is answering the question: “How many of the correct documents were returned?”
- Precision is high when the results returned are correct; Recall is high when the correct results are present.
- Recall does not care that all of the results are correct. Precision does not care that all of the results are present.
- measuring for
Recallacross the entire result set;
- measuring for
Precisiononly within the first page (or few pages) of search results.
Searching at Scale
Solr is able to scale to handle billions of documents and an infinite number of queries by adding servers.
The denormalized document
denormalized documentis one in which all fields are self-contained within the document, even if the values in those fields are duplicated across many documents.
- 优点：extreme scalability. Because we can make the assumption that each document is self-contained（自包含）, this means that we can also partition documents across multiple servers without having to keep related documents on the same server (because documents are independent of one another)
- sometimes your search servers may become overloaded by either too many queries at a time or by too much data needing to be searched through for a single server to handle.
- 后一种情况的解决办法：aggregated search:
1 2 3
- distributed search across multiple Solr cores is run in parallel on each of those index partitions
Clusters vs. Servers
- the servers for this use case are mutually dependent. If one becomes unavailable for searching, they all become unavailable for searching and begin failing
- Solr provides excellent built-in cluster-management capabilities through the use of Apache ZooKeeper
- Solr is NOT relational in any way across documents. It’s not well suited for joining significant amounts of data across different fields on different documents, and it can’t perform join operations at all across multiple servers.
- The denormalized nature of Solr can be particularly problematic when the data in one field that is shared across many documents changes
you can insert, delete, and update documents, but not sin- gle fields (easily).
whenever a new field is added to Solr or the contents of an existing field have changed, every single document in the Solr index must be reprocessed in its entirety before the data will be populated for the new field in all documents.
Solr is not optimized for processing quite long queries (thousands of terms) or returning quite large result sets to users.
- elastic scalability: the ability to automatically add and remove servers and redistribute content to handle load.