SpectX allows to execute structured queries directly on raw data files. In order to do so it needs to know which files to look for retrieving the data. To provide query engine with structured data the retrieved raw bytes need to be interpreted according to expected structure (the pattern) and transformed. Hence the query is logically split to two:
The preparatory phase may vary depending on the nature of the data. When acquiring data from structured files, such as AVRO or ROSBAG, pattern based parsing is not necessary and data is directly converted to structured record stream instead. When querying from relational databases SQL statements are executed on user specified tables, views instead of listing source files. However the query phase will operate always on record stream from preparatory phase.
Internally the query is processed as a sequence of stages. Consider a statement of piped SXQL commands, where each of the commands takes input from its predecessor and passes output to the next in line. The internal stages logically correspond to the same sequence (a statement written in SQL Style gets decomposed to similar sequence of stages). Stages often change the fields in the stream: adding new fields by applying functions, joining or unioning other streams. They may change the number of records in the stream by applying filtering/where clause. Or they may not apply any change at all (for instance when SpectX is used as ETL engine). The data passed between the stages is always in the form of record stream. In the end the output can be displayed on the screen, stored as a result table within SpectX or in a relational database, or exported in various formats (CSV, JSON, etc.).
The actual processing of a query takes place in distributed manner. The stages are decomposed to a set of even smaller tasks executed in parallel. The exact sequence and number of tasks is determined by query optimiser depending on the predicates and nature of the query (aggregation, windows, etc).
SpectX processes data as a snapshot. With each query execution, data is read from specified input resources. When caching is enabled then only the relevant delta of missing data are transferred.
SpectX queries are written as a script. This allows composing complex analysis tasks, consisting of many queries manipulating retrieved data. All this in easily readable manner accompanied with explaining comments.
Additionally, SpectX allows defining Views to capture input resources and data format definitions. Views are also good for hiding the location and format of data from analytics - i.e separating roles of data resource management and analytics.