we are doing the data validation to passing the source table for SQL view name like TEST_SCH.VIEWNAME and target datalake delta view for schema.deltaview to compare the column row count testing. so that we have to write many script to do execute.
I have common scripts which i have to master source and target table to pass the single scripts then my script should execute all the source and target views. both view we are passing common business key to sorting the final compared data to confirm there is no difference in the data. below scripts currently we are using but make it optimized and single script to pass list of source and target view name with business column details.
sample current scripts for us now.
%run "/SQLSERVER/TEST_ConnectionInfo"
table_name = "[TEST].[SQL_TABLE_VIEW]"
source = spark.read
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable", table_name)
.option("databaseName", database_name)
.option("accessToken", access_token)
.option("encrypt", "true")
.option("hostNameInCertificate", "*.database.windows.net")
.load()
source.createOrReplaceTempView("source_view")
df = spark.sql(f"""select * from target.deltatableviewname""")
df.createOrReplaceTempView("target")
Source=spark.sql('select * from source_view')
Target=spark.sql('select * from target')
first level of check source and target count should match our scripts
below one A-B and B-A concept with source – target data operation :
from pyspark.sql.functions import *
df_DataFromsrcDiff=Source.subtract(target).withColumn("Source", lit("A"))
df_DataFromtargetDiff=target.subtract(Source).withColumn("Source", lit("B"))
from pyspark.sql.functions import col
difference=df_DataFromtargetDiff.union(df_DataFromsrcDiff)
difference=diffdatadf.sort(col("businesskey for column"))
the final result will be 0 only, if 0 there is no difference both view data Comparision.
how should we achieve list of view and business key pass to single script to perform the data validation.