Spark SQL security considerations
What are the security considerations when accepting and executing arbitrary spark SQL queries?
Imagine the following setup:
Two files on hdfs are registered as tables a_secrets
and b_secrets
:
# must only be accessed by clients with access to all of customer a' data
spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets")
# must only be accessed by clients with access to all of customer b's data
spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets")
These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of these tables, that I'd like to expose:
# only access for clients with access to customer a's account no 1
spark.sql("SELECT * FROM a_secrets WHERE account = 1").createTempView("a1_secrets")
# only access for clients with access to customer a's account no 2
spark.sql("SELECT * FROM a_secrets WHERE account = 2").createTempView("a2_secrets")
# only access for clients with access to customer b's account no 1
spark.sql("SELECT * FROM b_secrets WHERE account = 1").createTempView("b1_secrets")
# only access for clients with access to customer b's account no 2
spark.sql("SELECT * FROM b_secrets WHERE account = 2").createTempView("b2_secrets")
Now assume I receive an arbitrary (user, pass, query)
set. I get a list of accounts the user can access:
groups = get_groups(user, pass)
and extract the logical query plan of the user's query:
spark.sql(query).explain(true)
giving me a query plan along the lines of (this exact query plan is made up)
== Analyzed Logical Plan ==
account: int, ... more fields
Project [account#0 ... more fields]
+- SubqueryAlias a1_secrets
+- Relation [... more fields]
+- Join Inner, (some_col#0 = another_col#67)
:- SubqueryAlias a2_secrets
: +- Relation[... more fields] csv
== Physical Plan ==
... InputPaths: hdfs:/customer_a/secrets.csv ...
Assuming I can parse a logical query plan to determine exactly which tables and files are being accessed, is it safe to grant access to the data produced by the query? I'm thinking of potential problems like:
- Are there ways to access registered tables without them showing up in a logical query plan?
- Are the ways to load new data and register it as tables through pure spark SQL? (input to
spark.sql(1)
)? - Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
- Are there ways to register UDFs/execute arbitrary code purely through
spark.sql(1)
?
To summarise: Can I safely accept arbitrary SQL, register it with df = spark.sql(1)
, analyse data access using df.explain(True)
, and then return results using e.g. df.collect()
?
Edits:
- 23 Jan 15:29: edited to include an "EXPLAIN" prefix in
apache-spark pyspark apache-spark-sql pyspark-sql
add a comment |
What are the security considerations when accepting and executing arbitrary spark SQL queries?
Imagine the following setup:
Two files on hdfs are registered as tables a_secrets
and b_secrets
:
# must only be accessed by clients with access to all of customer a' data
spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets")
# must only be accessed by clients with access to all of customer b's data
spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets")
These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of these tables, that I'd like to expose:
# only access for clients with access to customer a's account no 1
spark.sql("SELECT * FROM a_secrets WHERE account = 1").createTempView("a1_secrets")
# only access for clients with access to customer a's account no 2
spark.sql("SELECT * FROM a_secrets WHERE account = 2").createTempView("a2_secrets")
# only access for clients with access to customer b's account no 1
spark.sql("SELECT * FROM b_secrets WHERE account = 1").createTempView("b1_secrets")
# only access for clients with access to customer b's account no 2
spark.sql("SELECT * FROM b_secrets WHERE account = 2").createTempView("b2_secrets")
Now assume I receive an arbitrary (user, pass, query)
set. I get a list of accounts the user can access:
groups = get_groups(user, pass)
and extract the logical query plan of the user's query:
spark.sql(query).explain(true)
giving me a query plan along the lines of (this exact query plan is made up)
== Analyzed Logical Plan ==
account: int, ... more fields
Project [account#0 ... more fields]
+- SubqueryAlias a1_secrets
+- Relation [... more fields]
+- Join Inner, (some_col#0 = another_col#67)
:- SubqueryAlias a2_secrets
: +- Relation[... more fields] csv
== Physical Plan ==
... InputPaths: hdfs:/customer_a/secrets.csv ...
Assuming I can parse a logical query plan to determine exactly which tables and files are being accessed, is it safe to grant access to the data produced by the query? I'm thinking of potential problems like:
- Are there ways to access registered tables without them showing up in a logical query plan?
- Are the ways to load new data and register it as tables through pure spark SQL? (input to
spark.sql(1)
)? - Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
- Are there ways to register UDFs/execute arbitrary code purely through
spark.sql(1)
?
To summarise: Can I safely accept arbitrary SQL, register it with df = spark.sql(1)
, analyse data access using df.explain(True)
, and then return results using e.g. df.collect()
?
Edits:
- 23 Jan 15:29: edited to include an "EXPLAIN" prefix in
apache-spark pyspark apache-spark-sql pyspark-sql
1
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11
add a comment |
What are the security considerations when accepting and executing arbitrary spark SQL queries?
Imagine the following setup:
Two files on hdfs are registered as tables a_secrets
and b_secrets
:
# must only be accessed by clients with access to all of customer a' data
spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets")
# must only be accessed by clients with access to all of customer b's data
spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets")
These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of these tables, that I'd like to expose:
# only access for clients with access to customer a's account no 1
spark.sql("SELECT * FROM a_secrets WHERE account = 1").createTempView("a1_secrets")
# only access for clients with access to customer a's account no 2
spark.sql("SELECT * FROM a_secrets WHERE account = 2").createTempView("a2_secrets")
# only access for clients with access to customer b's account no 1
spark.sql("SELECT * FROM b_secrets WHERE account = 1").createTempView("b1_secrets")
# only access for clients with access to customer b's account no 2
spark.sql("SELECT * FROM b_secrets WHERE account = 2").createTempView("b2_secrets")
Now assume I receive an arbitrary (user, pass, query)
set. I get a list of accounts the user can access:
groups = get_groups(user, pass)
and extract the logical query plan of the user's query:
spark.sql(query).explain(true)
giving me a query plan along the lines of (this exact query plan is made up)
== Analyzed Logical Plan ==
account: int, ... more fields
Project [account#0 ... more fields]
+- SubqueryAlias a1_secrets
+- Relation [... more fields]
+- Join Inner, (some_col#0 = another_col#67)
:- SubqueryAlias a2_secrets
: +- Relation[... more fields] csv
== Physical Plan ==
... InputPaths: hdfs:/customer_a/secrets.csv ...
Assuming I can parse a logical query plan to determine exactly which tables and files are being accessed, is it safe to grant access to the data produced by the query? I'm thinking of potential problems like:
- Are there ways to access registered tables without them showing up in a logical query plan?
- Are the ways to load new data and register it as tables through pure spark SQL? (input to
spark.sql(1)
)? - Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
- Are there ways to register UDFs/execute arbitrary code purely through
spark.sql(1)
?
To summarise: Can I safely accept arbitrary SQL, register it with df = spark.sql(1)
, analyse data access using df.explain(True)
, and then return results using e.g. df.collect()
?
Edits:
- 23 Jan 15:29: edited to include an "EXPLAIN" prefix in
apache-spark pyspark apache-spark-sql pyspark-sql
What are the security considerations when accepting and executing arbitrary spark SQL queries?
Imagine the following setup:
Two files on hdfs are registered as tables a_secrets
and b_secrets
:
# must only be accessed by clients with access to all of customer a' data
spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets")
# must only be accessed by clients with access to all of customer b's data
spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets")
These two views, I could secure using simple hdfs file permissions. But say I have the following logical views of these tables, that I'd like to expose:
# only access for clients with access to customer a's account no 1
spark.sql("SELECT * FROM a_secrets WHERE account = 1").createTempView("a1_secrets")
# only access for clients with access to customer a's account no 2
spark.sql("SELECT * FROM a_secrets WHERE account = 2").createTempView("a2_secrets")
# only access for clients with access to customer b's account no 1
spark.sql("SELECT * FROM b_secrets WHERE account = 1").createTempView("b1_secrets")
# only access for clients with access to customer b's account no 2
spark.sql("SELECT * FROM b_secrets WHERE account = 2").createTempView("b2_secrets")
Now assume I receive an arbitrary (user, pass, query)
set. I get a list of accounts the user can access:
groups = get_groups(user, pass)
and extract the logical query plan of the user's query:
spark.sql(query).explain(true)
giving me a query plan along the lines of (this exact query plan is made up)
== Analyzed Logical Plan ==
account: int, ... more fields
Project [account#0 ... more fields]
+- SubqueryAlias a1_secrets
+- Relation [... more fields]
+- Join Inner, (some_col#0 = another_col#67)
:- SubqueryAlias a2_secrets
: +- Relation[... more fields] csv
== Physical Plan ==
... InputPaths: hdfs:/customer_a/secrets.csv ...
Assuming I can parse a logical query plan to determine exactly which tables and files are being accessed, is it safe to grant access to the data produced by the query? I'm thinking of potential problems like:
- Are there ways to access registered tables without them showing up in a logical query plan?
- Are the ways to load new data and register it as tables through pure spark SQL? (input to
spark.sql(1)
)? - Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
- Are there ways to register UDFs/execute arbitrary code purely through
spark.sql(1)
?
To summarise: Can I safely accept arbitrary SQL, register it with df = spark.sql(1)
, analyse data access using df.explain(True)
, and then return results using e.g. df.collect()
?
Edits:
- 23 Jan 15:29: edited to include an "EXPLAIN" prefix in
apache-spark pyspark apache-spark-sql pyspark-sql
apache-spark pyspark apache-spark-sql pyspark-sql
edited Jan 23 '17 at 15:29
asked Jan 23 '17 at 11:59
jkgeyti
1,7041326
1,7041326
1
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11
add a comment |
1
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11
1
1
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11
add a comment |
1 Answer
1
active
oldest
votes
TL;DR You should never execute any untrusted code on your Spark cluster.
Are the ways to load new data and register it as tables through pure spark SQL?
Yes. CREATE TABLE
can be executed using sql
method so if as long as users have permissions to access filesystem they can create tables.
Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?
Yes, as long they can control classpath which, can be modified with SQL.
spark.sql("""add jar URI""")
Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
Effectively yes (by extension of the previous point).
Can I safely accept arbitrary SQL,
No.
Can you expand on how a user can control the classpath using an spark sql query? A user does not callspark.sql
himself, but merely provide the query as a string, which is passed tospark.sql(1)
in a controlled application layer.
– jkgeyti
Jan 23 '17 at 13:08
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.
– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f41805876%2fspark-sql-security-considerations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
TL;DR You should never execute any untrusted code on your Spark cluster.
Are the ways to load new data and register it as tables through pure spark SQL?
Yes. CREATE TABLE
can be executed using sql
method so if as long as users have permissions to access filesystem they can create tables.
Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?
Yes, as long they can control classpath which, can be modified with SQL.
spark.sql("""add jar URI""")
Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
Effectively yes (by extension of the previous point).
Can I safely accept arbitrary SQL,
No.
Can you expand on how a user can control the classpath using an spark sql query? A user does not callspark.sql
himself, but merely provide the query as a string, which is passed tospark.sql(1)
in a controlled application layer.
– jkgeyti
Jan 23 '17 at 13:08
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.
– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
add a comment |
TL;DR You should never execute any untrusted code on your Spark cluster.
Are the ways to load new data and register it as tables through pure spark SQL?
Yes. CREATE TABLE
can be executed using sql
method so if as long as users have permissions to access filesystem they can create tables.
Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?
Yes, as long they can control classpath which, can be modified with SQL.
spark.sql("""add jar URI""")
Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
Effectively yes (by extension of the previous point).
Can I safely accept arbitrary SQL,
No.
Can you expand on how a user can control the classpath using an spark sql query? A user does not callspark.sql
himself, but merely provide the query as a string, which is passed tospark.sql(1)
in a controlled application layer.
– jkgeyti
Jan 23 '17 at 13:08
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.
– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
add a comment |
TL;DR You should never execute any untrusted code on your Spark cluster.
Are the ways to load new data and register it as tables through pure spark SQL?
Yes. CREATE TABLE
can be executed using sql
method so if as long as users have permissions to access filesystem they can create tables.
Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?
Yes, as long they can control classpath which, can be modified with SQL.
spark.sql("""add jar URI""")
Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
Effectively yes (by extension of the previous point).
Can I safely accept arbitrary SQL,
No.
TL;DR You should never execute any untrusted code on your Spark cluster.
Are the ways to load new data and register it as tables through pure spark SQL?
Yes. CREATE TABLE
can be executed using sql
method so if as long as users have permissions to access filesystem they can create tables.
Are there ways to register UDFs/execute arbitrary code purely through spark.sql(1)?
Yes, as long they can control classpath which, can be modified with SQL.
spark.sql("""add jar URI""")
Do users have access to any sql functions with side effects (that modifies or accesses unathorized data)?
Effectively yes (by extension of the previous point).
Can I safely accept arbitrary SQL,
No.
edited Jan 23 '17 at 16:20
answered Jan 23 '17 at 12:14
user7337271
582720
582720
Can you expand on how a user can control the classpath using an spark sql query? A user does not callspark.sql
himself, but merely provide the query as a string, which is passed tospark.sql(1)
in a controlled application layer.
– jkgeyti
Jan 23 '17 at 13:08
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.
– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
add a comment |
Can you expand on how a user can control the classpath using an spark sql query? A user does not callspark.sql
himself, but merely provide the query as a string, which is passed tospark.sql(1)
in a controlled application layer.
– jkgeyti
Jan 23 '17 at 13:08
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.
– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
Can you expand on how a user can control the classpath using an spark sql query? A user does not call
spark.sql
himself, but merely provide the query as a string, which is passed to spark.sql(1)
in a controlled application layer.– jkgeyti
Jan 23 '17 at 13:08
Can you expand on how a user can control the classpath using an spark sql query? A user does not call
spark.sql
himself, but merely provide the query as a string, which is passed to spark.sql(1)
in a controlled application layer.– jkgeyti
Jan 23 '17 at 13:08
1
1
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.– user7337271
Jan 23 '17 at 13:13
spark.sql("""add jar URI""")
is a valid query string. It even returns a table.– user7337271
Jan 23 '17 at 13:13
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
@jkgeyti There is a SQLAlchemy Hive dialect which could be helpful for you.
– user7337271
Jan 23 '17 at 16:19
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f41805876%2fspark-sql-security-considerations%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
This blog post might help you.
– swenzel
Jan 23 '17 at 12:16
Horton's Ranger (and Cloudera's RecordService) will provide a much more fully feature security layer between spark and my data, and practically, it would probably be a good idea to go straight down that route. But for now, I'm simply looking for a better understanding of the security implications of accepting sparksql strings straight from a user.
– jkgeyti
Jan 23 '17 at 13:11