Relative Content

Tag Archive for databricksdatabricks-sql

Slow databricks query when filtering by one column and sorting by another

We have huge table which stores information about blockchain blocks, of our particular interest is block number and its timestamp. Suppose we need to map timestamp to block number, to solve the task “give me the highest block before or at given timestamp”. This task translates into SQL:

Databricks Notebook (SQL) – Set Time Zone

Trying to deal with an issue in databricks where the logging framework we have put in place is logging the wrong time (using GETDATE() or CURRENT_TIMESTAMP()). We attempted to resolve this by using ‘SET TIME ZONE’ but it appears this is being ignored in this SQL statement?

Databricks: managed tables vs. external tables

managed tables are fully managed by the Databricks workspace, where Databricks handles the storage and metadata of the table, including the lifecycle of the data. When you create or delete a managed table, Databricks automatically manages the underlying data files.
External tables on the other hand, store their data outside of the Databricks-managed storage, often in external systems like cloud storage (e.g., AWS S3 or Azure Blob Storage). While Databricks manages the metadata for external tables, the actual data remains in the specified external location, providing flexibility and control over the data storage lifecycle. This setup allows users to leverage existing data storage infrastructure while utilizing Databricks’ processing capabilities.

how to parse csv using from_csv with schema_of_csv?

I want to parse csv data which is the value of column in another table. The thing is that I don’t know the schema of this csv data.
Currently, I am trying the query like this: SELECT from_csv(csv_col, schema_of_csv(csv_col)) AS csv_data FROM csv_data_table; but it looks like schema_of_csv does not accept column names.