The Search API provides a simple API to index and search structured data.
For the AppScale implementation we use SOLR, an open source search platform, which shares a very similar query syntax. In this blog we show the high level architecture and the different components used to implement the GAE Search API.
An application can index a document by providing a structure of fields to values. The fields can have different types such as text, numbers, geo spatial locations, or dates. A document is considered a set of fields and values, for which there is a unique document identifier. This structure is given to SOLR to index to be later queried using a query language. The query language is nearly identical between SOLR and GAE, and therefore we map it almost one-to-one, but slightly modify it so that we can achieve isolation between applications.
Applications can query their indexed documents using a simple query language. Queries can be as simple as looking for a keyword or set of keywords, or more complicated with boolean logic between keywords (AND/OR/NOT). You can also search for certain dates, or ranges of dates contained in a document.
// search for documents that mention cookies
// search for documents with laptops that cost less than $1000
index.search("product = laptop AND price < 1000")
The application servers take documents which need indexing and convert them into a serialized format (protocol buffers) which are then sent to a remote service (SearchServer). An index request consists of one or more documents and an index specification. This specification flags whether the content we’re looking for is for a specific field, or a global search across all fields. Applications servers receive the response in protocol buffer format. The response for an indexing request is a success or error code for each document. When querying data, the response contains the number of documents which matched the query, the documents themselves and the query expressions they matched.
The search server follows the standard way AppScale provides services. It is a multithreaded python web server which receives/sends protocol buffers. It maps the different types of requests from the application servers and maps them to SOLR operations. The main difficulty lies in mapping the different types and queries which are meant for GAE to SOLR compatible operations. The field names are prepended with the application ID and then the namespace for the SOLR equivalent names. This requires parsing the user query and replacing the field names, while maintaining the meaning given by the user. The application ID and namespace are provided in the field names for multitenancy and isolation between applications running in AppScale. These additions to the field names are striped before being passed back to the application.
AppScale’s implementation of the Search API is schema free in the backend, and therefore does not require a search index YAML file. Just specify which node does the “search” role in your AppScalefile and you can start using the Search API in AppScale 2.2.0 and higher.
master : 192.168.10.10
appengine : 192.168.10.11
database : 192.168.10.12
zookeeper : 192.168.10.13
search : 192.168.10.14