apache spark - How to refresh a table and do it concurrently? -




i'm using spark streaming 2.1. i'd refresh cached table (loaded spark provided datasource parquet, mysql or user-defined data sources) periodically.

  1. how refresh table?

    suppose have table loaded by

    spark.read.format("").load().createtempview("my_table")

    and cached by

    spark.sql("cache table my_table")

    is enough following code refresh table, , when table loaded next, automatically cached

    spark.sql("refresh table my_table")

    or have manually with

    spark.table("my_table").unpersist spark.read.format("").load().createorreplacetempview("my_table") spark.sql("cache table my_table")

  2. is safe refresh table concurrently?

    by concurrent mean using scheduledthreadpoolexecutor refresh work apart main thread.

    what happen if spark using cached table when call refresh on table?

in spark 2.2.0 have introduced feature of refreshing metadata of table if updated hive or external tools.

you can achieve using api,

spark.catalog.refreshtable("my_table") 

this api update metadata table keep consistent.





wiki

Comments

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -