配置并启动spark的thriftserver网关

发布时间：2023-05-14 13:17:46

Spark Thrift Server是一个面向外部JDBC/ODBC应用的Spark SQL Server，它提供了一个标准的SQL界面，允许用户通过SQL语句与Spark集群交互。Spark Thrift Server支持所有的JDBC/ODBC客户端，包括Hue、Tableau、Sqoop等。在本篇文章中，我们将介绍如何配置并启动Spark的Thrift Server网关。

1. 环境准备

在开始之前，我们需要在Spark集群上安装Spark和Hadoop。同时，我们还需要确保我们已经启动了Hadoop的所有服务，包括NameNode，DataNode，ResourceManager和NodeManager。我们还需要安装JDBC/ODBC客户端。

2. 配置Spark Thrift Server

在Spark集群的master节点上，我们需要进行一些Spark Thrift Server的配置：

2.1 首先，我们需要打开Spark的conf目录，并创建一个名为thriftserver的文件夹。命令如下：

cd $SPARK_HOME/conf
mkdir thriftserver

2.2 接下来，我们需要在thriftserver文件夹中创建一个名为hive-site.xml的文件。这个文件将Spark Thrift Server连接到Hive Metastore（如果你使用了Hive）。如果你没有使用Hive，你可以跳过这一步。命令如下：

cd $SPARK_HOME/conf/thriftserver
vim hive-site.xml

在vim编辑器中，你需要输入以下内容：

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://<hive_db_hostname>:<hive_db_port>/<hive_db_name>?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore </description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value><hive_db_user_name></value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value><hive_db_user_password></value>
        <description>password to use against metastore database</description>
    </property>
    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>false</value>
        <description>Set to true if you want to automatically create tables in the metastore database</description>
    </property>
    <property>
        <name>datanucleus.fixedDatastore</name>
        <value>true</value>
        <description>Set to true if you're not planning to add or remove columns to the Hive schema</description>
    </property>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://<hive_metastore_hostname>:<hive_metastore_port></value>
        <description>URI for the remote metastore. Used by metastore client to connect to remote metastore</description>
    </property>
</configuration>

这个文件包含了连接到Hive Metastore的JDBC连接信息。我们需要修改相应的值，以便连接到Hive。

2.3 接下来，我们需要打开thrift-defaults.conf文件。这个文件包含了Spark Thrift Server的默认设置。我们需要确保以下设置：

spark.sql.hive.thriftServer.singleSession = true 
spark.sql.hive.thriftServer.async = true 
spark.sql.hive.thriftServer.completionIterators = true

这些设置将确保在单个会话中连接到Spark Thrift Server。它们也允许Spark Thrift Server异步发送和接收SQL查询，以提高性能。

3. 启动Spark Thrift Server

在完成配置后，我们可以使用以下命令启动Spark Thrift Server：

$SPARK_HOME/sbin/start-thriftserver.sh \
    --master spark://<master_hostname>:<master_port> \
    --hiveconf hive.server2.thrift.bind.host=<thrift_server_hostname> \
    --hiveconf hive.server2.transport.mode=binary \
    --hiveconf hive.server2.thrift.http.port=<http_port> \
    --hiveconf hive.server2.thrift.port=<thrift_port>

请确保将<master_hostname>，<thrift_server_hostname>，<http_port>和<thrift_port>替换为你自己的值。启动Spark Thrift Server后，你应该能够使用任何支持JDBC/ODBC客户端的SQL工具连接到它，并开始执行查询。