Postgres tsvector data type
Optimize full-text search in Postgres with the tsvector data type
tsvector
is a specialized Postgres data type designed for full-text search operations. It represents a document in a form optimized for text search, where each word is reduced to its root form (lexeme) and stored with information about its position and importance.
In Postgres, the tsvector
data type is useful for implementing efficient full-text search capabilities, allowing for fast and flexible searching across large volumes of text data.
Storage and syntax
A tsvector
value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word. Each lexeme can be followed by position(s) and/or weight(s).
The general syntax for a tsvector
is:
Where:
word1
,word2
, etc., are the lexemes1
,3
, etc. are integers indicating the position of the word in the document- positions can sometimes be followed by a letter to indicate a weight ('A', 'B', 'C' or 'D'), like
2A
. The default weight is 'D'.
For example:
'a':1A 'cat':2 'sat':3 'on':4 'the':5 'mat':6
When a document is cast to tsvector
, it doesn't perform any normalization and just splits the text into lexemes. To normalize the text, you can use the to_tsvector
function with a specific text search configuration. For example:
This query produces the following output. The function to_tsvector()
tokenizes the input document and computes the normalized lexemes based on the specified text search configuration (in this case, 'english'). The output is a tsvector
with the normalized lexemes and their positions.
Example usage
Consider a scenario where we're building a blog platform and want to implement full-text search for articles. We'll use tsvector
to store the searchable content of each article.
The query below creates a table and inserts some sample blog data:
To search for blog posts containing specific words, we can use the match operator @@
, with a tsquery
search expression:
This query returns the following output:
Other examples
tsvector
Use different text search configurations with Postgres supports text search configurations for multiple languages. Here's an example using the 'spanish' configuration:
This query returns the following output:
tsvector
column
Rank the search results from a We can use the ts_rank
function to rank search results based on relevance:
This query returns the following output:
All the articles were related to climate change, but the first article was ranked higher due to the higher relevance for the search terms.
Additional considerations
- Performance: While
tsvector
enables fast full-text search, creating and updatingtsvector
columns can be computationally expensive. Consider using triggers or background jobs to updatetsvector
columns asynchronously. - Storage:
tsvector
columns can significantly increase the size of your database. Monitor your database size and consider using partial indexes if full-text search is only needed for a subset of your data. - Language support: PostgreSQL supports many languages out of the box, but you may need to install additional dictionaries for some languages.
- Stemming and stop words: The text search configuration determines how words are stemmed and which words are ignored as stop words. Choose the appropriate configuration for your use case.
Resources
Need help?
Join our Discord Server to ask questions or see what others are doing with Neon. Users on paid plans can open a support ticket from the console. For more details, see Getting Support.