通过 Impala 和 Hive 访问集群版历史数据

最后更新于：2017-11-24 11:24:17

本文简述如何结合 API 通过 Impala 和 Hive 直接读取并解析神策分析集群版中的历史事件数据表。

需要说明，直接访问底层数据表有很强的风险，如果后端升级可能会导致相关访问方式失效，因此建议在选择这种访问方式之前先确认 JDBC访问是否可以满足需求，并优先使用这种方式导出数据。

1. 元数据

通过 Hive 和 Impala 读取集群版的历史数据前，需要获取一些必要的元数据，获取元数据需要使用神策分析的 API。

A. 项目/事件/属性元数据获取

请先了解 API 的基本使用。

熟悉 API 基本使用后，请用以下三个 API 获取项目/事件/属性的元数据

/api/projects
/api/events/all
/api/property/all

其中 projects 返回结果中 name 是 project 名, id 是要查询的 impala 表的表名后缀（如id 为 1 则表名为 event_ros_p1）。此 API 返回内容大致如下：

curl http://....../api/projects...

[
    {
        "cname": "\u6d4b\u8bd5\u9879\u76ee", # 项目中文名
        "create_time": "2016-11-08 16:26:46",
        "id": 1,                             # 是 对应 impala 表 表名的后缀
        "name": "default",                   # 项目英文名 / 标识符
        "status": "AVAILABLE"
    },
    {
        "cname": "\u6b63\u5f0f\u9879\u76ee",
        "create_time": "2016-11-08 16:27:00",
        "id": 2,
        "name": "production",
        "status": "AVAILABLE"
    },

也可以使用多项目管理工具来查看项目 ID 信息，例如:

project_tools -m list

------------------------------------------
project.id:  1 
name:        default
cname:       default
super_token: null
normal_token:    null
status:      AVAILABLE
create_time: 2016-11-26 18:39:59.0
------------------------------------------

其中 events/all 返回结果中 id 是后面要用到重要字段，举例如下：

curl http://....../api/events/all...

[
    {
        "cname": "$_sign_up",
        "id": 34,                     # impala 表中 event_id 字段，对应此处的 id 值，结合此处元数据可解析出对应何种事件.
        "name": "$_sign_up",
        "pinyin": "$_sign_up",
        "tag": [
            3
        ],
        "virtual": false,
        "visible": true
    },
    {
        "cname": "$AppEnd",
        "id": 38,
        "name": "$AppEnd",
        "pinyin": "$AppEnd",
        "tag": [
            3,
            13
        ],
        "virtual": false,
        "visible": true
...

其中 property/all 返回结果中 db_column_name 和 data_type 是解析属性数据的重要字段.

此处举例说明属性元数据中重要字段含义，例如：

curl http://....../api/property/all...

[
    {
        "cardinality": 0,
        "cname": "iOS\u6e20\u9053\u8ffd\u8e2a\u5339\u914d\u6a21\u5f0f",
        "data_type": "string",                     # 字段类型，取值有 bool, number, string, list, date, datetime，存储格式各异，详见后文描述
        "db_column_name": "p__utm_matching_type",  # 属性对应的 impala 表的列名，需要依据此字段读取 impala 表中的属性数据
        "event_id": -1,
        "event_name": "$Anything",
        "has_db_column_name": true,
        "has_dict": false,
        "id": 95,
        "is_dimension": true,
        "is_in_use": true,
        "is_measure": false,
        "name": "$utm_matching_type",
        "pinyin": "iOS qu dao zhui zong pi pei mo shi"
    },

...

2. 访问历史数据

A. 查询

数据表使用 impala 或 hive 查询。

请注意 impala/hive 对 parquet 数据 timestamp 类型（time字段使用）处理不同，详见后文 “time 字段时区问题” 一节。

为正常访问神策分析的 Parquet 文件，Impala 需要升级到 2.6.0 以上，通过打开 PARQUET_FALLBACK_SCHEMA=true 支持神策的 Parquet 文件，具体请参考： https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_parquet_fallback_schema_resolution.html

Hive 需要升级到 0.13.0 以上，保留默认设置 PARQUET_FALLBACK_SCHEMA=true，具体可以参考：https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.13.0

B. 库名 / 表名

库统一为 rawdata

event 历史数据表的表名为 event_ros_p{project_id} profile 数据表的表名为 profile_wos_p{project_id}

例如：

项目 default 的 id 为 1，对应的 event 历史数据表为 event_ros_p1，所属的库为 rawdata （以下简称 event 表）

C. time 字段时区问题

impala/hive 对 parquet 数据 timestamp 类型（time字段使用）处理不同

例如，神策系统设定时区为 Asia/Chongqing，同一条数据，用 Impala 查询，时间为 2016-12-12 08:33:57.930000000，而用 hive 查询则为 2016-12-12 16:33:57.93

time 字段存储的是 UTC 时间的数据，impala 按 UTC 时区读取，并换为 Asia/Chongqing 的本地时间表示后返回，而 hive 则直接按 Asia/Chongqing 读取事件数据，不做时区转换.

结论是：

从 impala 中直接读出的 time，可以认为是正确的。
想从 hive 中读出正确的 time 值，最简单的方法是把结果 -8 小时（假设本地时区是 Asia/Chongqing 东8区）.

Impala 文档中对这一点有详细解释，详见：http://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html#timestamp

D. 表结构

下面举例说明 event 表包含的字段和含义

+---------------------------+-----------+---------+
| name                      | type      | comment |
+---------------------------+-----------+---------+
| event_id                  | int       |         | 事件id，即 event_define.id
| month_id                  | int       |         | 月份 id，1970-01-01 所属 id 为 0，往后每个自然月 +1
| week_id                   | int       |         | 星期 id，1970-01-01 所属 id 为 0，往后每个自星期 +1
| user_id                   | bigint    |         | 神策系统内部使用的用户标识
| distinct_id               | string    |         | 此事件发生时，所使用的用户标识 
| time                      | timestamp |         | 事件发生的时间，注意 impala 和 hive 读出的不同
| day                       | int       |         | 日期 id, 1970-01-01 所属 id 为 0, 往后每个自然天 +1
| event_bucket              | int       |         | 事件分桶 id，过滤数据时，可作为过滤条件加快过滤
| p__app_version            | string    |         | 属性字段，均以 p_ 开头，下同。 property 元数据中 db_column_name 既对应此处的列名，属性字段存储格式详见后文
| p__city                   | string    |         |
| p__country                | string    |         |
| p__ip                     | string    |         |
| p__lib                    | string    |         |
| p__lib_version            | string    |         |
| p__manufacturer           | string    |         |
| p__model                  | string    |         |
| p__os                     | string    |         |
| p__os_version             | string    |         |
| p__province               | string    |         |
| p__screen_height          | bigint    |         |
| p__screen_width           | bigint    |         |
| p__wifi                   | bigint    |         |
| p__carrier                | string    |         |
| p__is_first_time          | bigint    |         |
| p__network_type           | string    |         |
| p__resume_from_background | bigint    |         |
| p__screen_name            | string    |         |
| p_event_duration          | bigint    |         |
| p_allowanceamount         | bigint    |         |
| p_channel                 | string    |         |
| p_keyword                 | string    |         |
| p_orderid                 | string    |         |
+---------------------------+-----------+---------+

eventid, month_id, week_id, user_id, distinct_id, time, day, event_bucket 这些字段所有 event 表均包含. p* 字段个数由各个项目的属性个数决定。

3. 属性值解析

NUMBER(1), STRING(2), LIST(3), DATE(4), DATETIME(5), BOOL(6)

属性字段命名为 p_*，与属性英文名相近，查询请以 property API 中 db_column_name 记录的为准.

不同类型的的存储格式和类型有所不同，具体如下：

Sensors Analytics 数据类型	中文名	data_type id	impala/hive 表字段类型	存储格式
NUMBER	数值型	1	bigint	原始值 * 1000 取整存储。读取时需要 /1000 得到原始值。最终范围：-9E15 到 9E15 小数点后最多保留3位
STRING	字符串	2	string	UTF-8 编码的字符串
LIST	字符串集合	3	string	UTF-8 编码的，"\n" 分割的多个字符串
DATE	日期	4	bigint	按天取整的毫秒时间戳，例如日期为 2015/6/1 则，取当天北京时间 0:0:0，即 2015/6/1 0:0:0 计算对应的时间戳为 1433088000000。读取时，解析时间戳后取日期部分即可
DATETIME	日期时间	5	bigint	存储毫秒时间戳
BOOL	布尔值	6	bigint	真为 1 假为 0