この記事は約100分16秒で読むことができます。

sql oracle MATCH_RECOGNIZE lesson

ことはじめ

MATCH_RECOGNIZE句の動きを始める前より理解を深めたい。

参考文献

DEEP DIVE INTO 12c MATCH_RECOGNIZE  
Overview of Pattern Matching in Data Warehouses  
Pattern Matching (MATCH_RECOGNIZE) in Oracle Database 12c Release 1 (12.1)  
MATCH_RECOGNIZE - SKIP TO where exactly?  
MATCH_RECOGNIZE - Log file sessionization analysis  
Introduction to MATCH_RECOGNIZE

テストデータの準備

oracle_baseさんのデータを拝借。分かりやすいです。とても。

sales_historyを作成。

コード表示

DROP TABLE sales_history PURGE;

CREATE TABLE sales_history (
  id            NUMBER,
  product       VARCHAR2(20),
  tstamp        TIMESTAMP,
  units_sold    NUMBER,
  CONSTRAINT sales_history_pk PRIMARY KEY (id)
);

ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

INSERT INTO sales_history VALUES ( 1, 'TWINKIES', '01-OCT-2014', 17);
INSERT INTO sales_history VALUES ( 2, 'TWINKIES', '02-OCT-2014', 19);
INSERT INTO sales_history VALUES ( 3, 'TWINKIES', '03-OCT-2014', 23);
INSERT INTO sales_history VALUES ( 4, 'TWINKIES', '04-OCT-2014', 23);
INSERT INTO sales_history VALUES ( 5, 'TWINKIES', '05-OCT-2014', 16);
INSERT INTO sales_history VALUES ( 6, 'TWINKIES', '06-OCT-2014', 10);
INSERT INTO sales_history VALUES ( 7, 'TWINKIES', '07-OCT-2014', 14);
INSERT INTO sales_history VALUES ( 8, 'TWINKIES', '08-OCT-2014', 16);
INSERT INTO sales_history VALUES ( 9, 'TWINKIES', '09-OCT-2014', 15);
INSERT INTO sales_history VALUES (10, 'TWINKIES', '10-OCT-2014', 17);
INSERT INTO sales_history VALUES (11, 'TWINKIES', '11-OCT-2014', 23);
INSERT INTO sales_history VALUES (12, 'TWINKIES', '12-OCT-2014', 30);
INSERT INTO sales_history VALUES (13, 'TWINKIES', '13-OCT-2014', 31);
INSERT INTO sales_history VALUES (14, 'TWINKIES', '14-OCT-2014', 29);
INSERT INTO sales_history VALUES (15, 'TWINKIES', '15-OCT-2014', 25);
INSERT INTO sales_history VALUES (16, 'TWINKIES', '16-OCT-2014', 21);
INSERT INTO sales_history VALUES (17, 'TWINKIES', '17-OCT-2014', 35);
INSERT INTO sales_history VALUES (18, 'TWINKIES', '18-OCT-2014', 46);
INSERT INTO sales_history VALUES (19, 'TWINKIES', '19-OCT-2014', 45);
INSERT INTO sales_history VALUES (20, 'TWINKIES', '20-OCT-2014', 30);
COMMIT;

sales_historyの確認

コード表示

ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

SET PAGESIZE 50
COLUMN product    FORMAT A10
COLUMN tstamp     FORMAT A11
COLUMN graph      FORMAT A50

SELECT id,
       product,
       tstamp,
       units_sold,
       RPAD('#', units_sold, '#') AS graph
FROM   sales_history
ORDER BY id;
コード表示

           ID PRODUCT    TSTAMP         UNITS_SOLD GRAPH
------------- ---------- ----------- ------------- --------------------------------------------------
            1 TWINKIES   01-OCT-2014            17 #################
            2 TWINKIES   02-OCT-2014            19 ###################
            3 TWINKIES   03-OCT-2014            23 #######################
            4 TWINKIES   04-OCT-2014            23 #######################
            5 TWINKIES   05-OCT-2014            16 ################
            6 TWINKIES   06-OCT-2014            10 ##########
            7 TWINKIES   07-OCT-2014            14 ##############
            8 TWINKIES   08-OCT-2014            16 ################
            9 TWINKIES   09-OCT-2014            15 ###############
           10 TWINKIES   10-OCT-2014            17 #################
           11 TWINKIES   11-OCT-2014            23 #######################
           12 TWINKIES   12-OCT-2014            30 ##############################
           13 TWINKIES   13-OCT-2014            31 ###############################
           14 TWINKIES   14-OCT-2014            29 #############################
           15 TWINKIES   15-OCT-2014            25 #########################
           16 TWINKIES   16-OCT-2014            21 #####################
           17 TWINKIES   17-OCT-2014            35 ###################################
           18 TWINKIES   18-OCT-2014            46 ##############################################
           19 TWINKIES   19-OCT-2014            45 #############################################
           20 TWINKIES   20-OCT-2014            30 ##############################

20 rows selected.

Elapsed: 00:00:00.02

拝借したsqlをそのまま実行

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FINAL LAST(UP.tstamp) AS peak_tstamp,
                   FINAL LAST(DOWN.tstamp) AS end_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST DOWN
         PATTERN (STRT UP+ FLAT* DOWN+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

FINAL LAST(pattern.col_name)は定義したパタンが最後に現れたcol_name値を返却する。パタンマッチした範囲(フレーム)の中で一番大きい値かな。

STRTはパタン変数の定義のみにとどまっているが、これはパタンマッチの開始点として定義している。開始点はのこりのパタン定義より判断できるからだとおもう。

MATCH_NUMBER()に関してはパタンマッチした範囲における通番を返却。どの範囲ないし期間かを判別できる番号を返却。

pattern句に定義した順にパタンマッチをしていくので、定義する順番を変えてしまうと、異なる結果が得られる。パタンを単純に定義するだけでなく、パタンマッチしていく順番も考慮に入れる必要がある。

CLASSIFIER()に関してはパタンマッチした範囲において当該レコードがどのパタンに属しているのかを判別できる値を返却してくれる。

pattern句に定義した最後のパタンを見つけるまで、走査し、今回の例だとDOWN。発見した次のパタンはSTRT。そこでまたDOWNパタンが見つかるまで走査。多分その繰り返し。パタンを定義する際には最初と最後のパタンをまず決めて、そのパタン走査の中で、起こりうるパタンを定義していくイメージが書きやすそう。

AFTER MATCH SKIP TO LAST パタン変数に関してはパタンマッチした範囲において最後に現れたパタン変数はそのパタンとしては認識せず、次のパタンとして認識するように明示的指定できるもの。今回のAFTER MATCH SKIP TO LAST DOWNに関していえばパタンマッチした範囲の中で、最後に現れたDOWNパタンはスキップし、STRTパタンとしてパタンマッチするように促すことができる。2つのパタン定義を満たす1点が存在するときに、そのどちらのパタンを優先させてやるかを指定できる。id列にその観点が表現されている。

PER MATCHで出力する行をサマリ行か明細行も含めたすべてかを指定することができる。以下の資料が参考になる。

AFTER MATCH SKIP: Defining Where to Restart the Matching Process After a Match Is Found  
Tasks and Keywords in Pattern Matching PER MATCH: Choosing Summaries or Details for Each Match

パーティション違いでも効くか試してみる

コード表示

insert into sales_history select max(id) over () + rownum as ID, 'kiwiiiii' as PRODUCT, TSTAMP, UNITS_SOLD from sales_history;
commit;

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FINAL LAST(UP.tstamp) AS peak_tstamp,
                   FINAL LAST(DOWN.tstamp) AS end_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST DOWN
         PATTERN (STRT UP+ FLAT* DOWN+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

パーティションごとにパタンマッチされている

パタン句の順番を変えてみる

STRT→DOWN→FLAT→UP→STRTの順に。

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FINAL LAST(UP.tstamp) AS peak_tstamp,
                   FINAL LAST(DOWN.tstamp) AS end_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST UP
         PATTERN (STRT DOWN+ FLAT* UP+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

FLATのパタンマッチが見えなくなった。日付をまたいで同じUNITS_SOLDがなくなったから。

LAST()をFIRST()に変えてみる

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FINAL FIRST(UP.tstamp) AS peak_tstamp,
                   FINAL FIRST(DOWN.tstamp) AS end_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST UP
         PATTERN (STRT DOWN+ FLAT* UP+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

パタンマッチした範囲でUPパタン、DOWNパタンが最初に現れた日付を返却している。

FINALをとってみる

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FIRST(UP.tstamp) AS peak_tstamp,
                   FINAL FIRST(DOWN.tstamp) AS end_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST UP
         PATTERN (STRT DOWN+ FLAT* UP+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

FINALがついていない列にnullが現れた。パタンマッチした範囲を最初に枠決めすると思われる。その範囲においてカレント行がUPパタンなのか、DOWNパタンなのかをしていくと思うので、まだ検知されていないから、返却する値が分からず、NULLが現れた。

measures句に集計関数をいれてみる

measures句は対象テーブルのカラムに加えて、確認したい列を定義したパタンを使って独自に指定できる。pattern句には正規表現が使える。

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   FINAL LAST(UP.tstamp) AS peak_tstamp,
                   FINAL LAST(DOWN.tstamp) AS end_tstamp,
                   FINAL count(UP.tstamp) AS up_cnt,
                   FINAL count(FLAT.tstamp) AS flat_cnt,
                   FINAL count(DOWN.tstamp) AS down_cnt,
                   RUNNING count(tstamp) AS runnig_cnt,
                   FINAL count(tstamp) AS summury_cnt,
                   units_sold - STRT.units_sold AS diff,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST DOWN
         PATTERN (STRT UP+ FLAT* DOWN+ )
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

runningキーワードはパタンマッチした範囲で通番を振ってくれる。集計関数はパタンマッチした範囲の中で、適用される。

ここで、ちょっとグラフ化してみる

対象データはこれにした。

コード表示

insert into sales_history select max(id) over () + rownum as ID, 'kiwiiiii' as PRODUCT, TSTAMP, UNITS_SOLD - 10 as UNITS_SOLD from sales_history;
commit;

エクセルに食わせるデータはこのsqlで生成

コード表示

ALTER SESSION SET nls_timestamp_format = 'YYYYMMDD';

WITH tbl_head AS (
SELECT
	tstamp
	,MIN(twinkies) AS twinkies
	,MIN(kiwiiiii) AS kiwiiiii
FROM
	sales_history
	PIVOT(MIN(units_sold) FOR product IN ('TWINKIES' AS twinkies,'kiwiiiii' AS kiwiiiii))
GROUP BY
	tstamp
ORDER BY
	tstamp
),tbl_side AS (
SELECT
	to_char(dtt,'yyyymmdd') AS dt
FROM
	dual
MODEL
	DIMENSION BY ( 1 AS rn )
	MEASURES ( TO_DATE('20141031') AS dtt )
	RULES ITERATE(31)
	( dtt[iteration_number] = TO_DATE('20141031') - iteration_number )
)
SELECT
	nvl(tstamp,dt) AS tstamp
	,twinkies
	, kiwiiiii
FROM
	tbl_side s1
		LEFT OUTER JOIN tbl_head s2
	ON
		s1.dt = s2.tstamp
ORDER BY
	s1.dt
;

パタン句にパタンを追加してみる

STRT→DOWN+→FLAT*→UP+→DOWN+→FLAT*→UP+。W型パタンあるか。FLATはあれば検知する。

コード表示

SET LINESIZE 110
ALTER SESSION SET nls_timestamp_format = 'DD-MON-YYYY';

COLUMN start_tstamp FORMAT A11
COLUMN peak_tstamp  FORMAT A11
COLUMN end_tstamp   FORMAT A11
COLUMN cls          FORMAT A5
COLUMN units_sold FORMAT 9999
set linesize 200

SELECT *
FROM   sales_history MATCH_RECOGNIZE (
         PARTITION BY product
         ORDER BY tstamp
         MEASURES  STRT.tstamp AS start_tstamp,
                   LAST(DOWN.tstamp) AS cur_bottom_tstamp,
                   FINAL LAST(DOWN.tstamp) AS bottom_tstamp,
                   LAST(UP.tstamp) AS cur_peak_tstamp,
                   FINAL LAST(UP.tstamp) AS peak_tstamp,
                   MATCH_NUMBER() AS mno,
                   CLASSIFIER() AS cls
         ALL ROWS PER MATCH
         AFTER MATCH SKIP TO LAST UP
         PATTERN (STRT DOWN+ FLAT* UP+ DOWN+ FLAT* UP+)
         DEFINE
           UP AS UP.units_sold > PREV(UP.units_sold),
           DOWN AS DOWN.units_sold < PREV(DOWN.units_sold),
           FLAT AS FLAT.units_sold = PREV(FLAT.units_sold)
       ) MR
ORDER BY MR.product, MR.mno, MR.tstamp;

セッションナイズはじめてみる

セッショナイズの定義はいろいろあるみたいだけど、今回は同一のユーザーの連続するアクセスで、アクセス間隔が3秒未満である一連のアクセスを一つのセッションとしてみる。

コード表示

COL NAME FOR a40;
COL VALUE FOR a100;

SELECT NAME, VALUE FROM v$parameter WHERE NAME LIKE '%' || 'timestamp' ||'%';

ALTER SESSION SET nls_timestamp_tz_format = 'YYYY-MM-DD HH24:MI:SS';

DROP TABLE test______ PURGE;

CREATE TABLE test______ AS 
WITH sub AS (
SELECT
	*
FROM
	dual
MODEL
	DIMENSION BY ( 1 AS rn )
	MEASURES ( current_timestamp AS tp )
	RULES ITERATE(30)
	( tp[iteration_number] = current_timestamp - iteration_number/86400 )
ORDER BY
	dbms_random.random()
)
SELECT
	rpad('user0',6,decode(MOD(ROWNUM,6),0,6,MOD(ROWNUM,6))) AS user_id
	,rpad('item',6,CHR(64 + decode(MOD(ROWNUM,4),0,4,MOD(ROWNUM,4)))) AS item
	,TRUNC(ABS(dbms_random.VALUE(1,10)),0) AS qty
	,tp
FROM
	sub
ORDER BY
	1,4
;

SELECT * FROM test______ ORDER BY user_id,tp;

コード表示

[oracle@f285aba0589a ~]$ sqlplus aine/ORACLE_PWD@pdb1

SQL*Plus: Release 18.0.0.0.0 - Production on Fri Feb 22 23:38:47 2019
Version 18.3.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Fri Feb 22 2019 23:35:40 +09:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.3.0.0.0

AINE@pdb1> col name for a40;
AINE@pdb1> col value for a100;
AINE@pdb1>
AINE@pdb1> SELECT NAME, VALUE FROM v$parameter WHERE NAME LIKE '%' || 'timestamp' ||'%';

NAME                                     VALUE
---------------------------------------- ----------------------------------------------------------------------------------------------------
uniform_log_timestamp_format             TRUE
nls_timestamp_format
nls_timestamp_tz_format

3 rows selected.

Elapsed: 00:00:00.01
AINE@pdb1> ALTER SESSION SET nls_timestamp_tz_format = 'YYYY-MM-DD HH24:MI:SS';

Session altered.

Elapsed: 00:00:00.00
AINE@pdb1> SELECT NAME, VALUE FROM v$parameter WHERE NAME LIKE '%' || 'timestamp' ||'%';

NAME                                     VALUE
---------------------------------------- ----------------------------------------------------------------------------------------------------
uniform_log_timestamp_format             TRUE
nls_timestamp_format
nls_timestamp_tz_format                  YYYY-MM-DD HH24:MI:SS

3 rows selected.

Elapsed: 00:00:00.01
AINE@pdb1>

サマリ行だけ出力するようにしてみる

コード表示

SELECT
	*
FROM
	test______ MATCH_RECOGNIZE(
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() as session_id
			,CLASSIFIER() AS cls
			,COUNT(*) as evt_cnt
			,FIRST(strt.tp) as start_time
			,LAST(dur.tp) as end_time
			,LAST(dur.tp) - FIRST(strt.tp) as session_duration
   ONE ROW PER MATCH
--   ALL ROWS PER MATCH
   PATTERN (strt dur+)
   DEFINE
       dur as (tp - prev(tp) <= INTERVAL '3' SECOND )
 );

明細行も出力するようにしてみる

コード表示

SELECT
	*
FROM
	test______ MATCH_RECOGNIZE(
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() as session_id
			,CLASSIFIER() AS cls
			,COUNT(*) as evt_cnt
			,FIRST(strt.tp) as start_time
			,LAST(dur.tp) as end_time
			,LAST(dur.tp) - FIRST(strt.tp) as session_duration
--   ONE ROW PER MATCH
   ALL ROWS PER MATCH
   PATTERN (strt dur+)
   DEFINE
       dur as (tp - prev(tp) <= INTERVAL '3' SECOND )
 );

コード表示

USER_ID                  ITEM                               QTY TP
------------------------ ------------------------ ------------- ---------------------------------------------------------------------------
user01                   itemCC                               9 2019-02-23 01:14:09←★開始行
user01                   itemAA                               9 2019-02-23 01:14:18←★一行前と比べて3秒以内のアクセスではないため、開始行
user01                   itemAA                               6 2019-02-23 01:14:19←★一行前と比べて3秒以内のアクセスであるため、連続行
user01                   itemAA                               4 2019-02-23 01:14:24←★一行前と比べて3秒以内のアクセスではないため、開始行
user01                   itemCC                               6 2019-02-23 01:14:32←★一行前と比べて3秒以内のアクセスではないため、開始行

コード表示

USER_ID                  ITEM                               QTY TP
------------------------ ------------------------ ------------- ---------------------------------------------------------------------------
user04                   itemBB                               4 2019-02-23 01:14:13←★開始行
user04                   itemDD                               4 2019-02-23 01:14:14←★一行前と比べて3秒以内のアクセスであるため、連続行
user04                   itemBB                               2 2019-02-23 01:14:20←★一行前と比べて3秒以内のアクセスではないため、開始行
user04                   itemDD                               7 2019-02-23 01:14:28←★一行前と比べて3秒以内のアクセスではないため、開始行
user04                   itemDD                               7 2019-02-23 01:14:29←★一行前と比べて3秒以内のアクセスであるため、連続行

partition by で区切ったあと、order byで並べて、パタンマッチを先頭行から開始する。COUNT(*)の振る舞いとしてはパタンマッチした範囲で通番を振ってくれる。ここからデフォルトはrunningキーワードをつけたときと同じことがわかる。今回の場合はPATTERN (strt dur+)を1パタンとして認識した場合のこと。パタンごとの回数をみたいときはパタン変数.カラム名を引数に指定して挙動を調べる。

コード表示

SELECT
	*
FROM
	test______ MATCH_RECOGNIZE(
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() as session_id
			,CLASSIFIER() AS cls
			,COUNT(*) as evt_cnt
			,COUNT(strt.tp) as evt__cnt
			,COUNT(dur.tp) as evt___cnt
			,FIRST(strt.tp) as start_time
			,LAST(dur.tp) as end_time
			,LAST(dur.tp) - FIRST(strt.tp) as session_duration
   ONE ROW PER MATCH
--   ALL ROWS PER MATCH
   PATTERN (strt dur+)
   DEFINE
       dur as (tp - prev(tp) <= INTERVAL '3' SECOND )
 );

コード表示

SELECT
	*
FROM
	test______ MATCH_RECOGNIZE(
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() as session_id
			,CLASSIFIER() AS cls
			,COUNT(*) as evt_cnt
			,COUNT(strt.tp) as evt__cnt
			,COUNT(dur.tp) as evt___cnt
			,FIRST(strt.tp) as start_time
			,LAST(dur.tp) as end_time
			,LAST(dur.tp) - FIRST(strt.tp) as session_duration
--   ONE ROW PER MATCH
   ALL ROWS PER MATCH
   PATTERN (strt dur+)
   DEFINE
       dur as (tp - prev(tp) <= INTERVAL '3' SECOND )
 );

分析関数でも少し掘り下げてみる。cum列の値が連続している行がONE ROW PER MATCHで出力されている行。開始時刻と終了時刻は連続行をサマレば、出力できる。

コード表示

with tmp as (
select
	s1.*
	,case
		when s1.tp - lag(s1.tp) over (partition by user_id order by s1.tp) <= interval '3' second then 0
		else 1
	end as session_start_flg
from
	TEST______ s1
)
select
	s1.*
	,sum(s1.session_start_flg) over (partition by user_id order by tp rows between unbounded preceding and current row) as cum
from
	tmp s1
;

セッションをグラフ化してみる

以下のsqlでEXCELに食わせるデータを作成。nls_timestamp_tz_format型を使っていると、インクリうまくいかなくてあーだーこーだなったので、sysdateでうまくいったやつをあとでのせた。こいつは汎用性ナイ。nls_date_formatのほうがいいね。これはたまたまうまくいった。

コード表示

WITH cal AS (
	SELECT
		tp
	FROM
		test______
),tmp AS (
	SELECT
		s1.*
		,s0.*
	FROM
		cal s1
		LEFT OUTER JOIN 
		test______ MATCH_RECOGNIZE(
			PARTITION BY user_id
			ORDER BY tp
			MEASURES
				match_number() AS session_id
				,classifier() AS cls
				,COUNT(*) AS evt_cnt
				,COUNT(strt.tp) AS evt__cnt
				,COUNT(dur.tp) AS evt___cnt
				,FIRST(strt.tp) AS start_time
				,LAST(dur.tp) AS end_time
				,LAST(dur.tp) - FIRST(strt.tp) AS session_duration
	   ONE ROW PER MATCH
	--   ALL ROWS PER MATCH
	   PATTERN (strt dur+)
	   DEFINE
		   dur AS (tp - prev(tp) <= INTERVAL '3' SECOND )
	)s0 
		ON
			s1.tp BETWEEN s0.start_time AND s0.end_time
)
SELECT
	s1.tp
	, nvl(s1.user01,0) AS user01
	, nvl(s1.user02,0) AS user02
	, nvl(s1.user03,0) AS user03
	, nvl(s1.user04,0) AS user04
	, nvl(s1.user05,0) AS user05
	, nvl(s1.user06,0) AS user06
FROM
	tmp s0
	PIVOT(MIN(s0.evt__cnt) FOR user_id IN ('user01' AS user01,'user02' AS user02,'user03' AS user03,'user04' AS user04,'user05' AS user05,'user06' AS user06)) s1
;

取得できたデータ

グラフにした。本当はパルスで書きたいけど、いろいろねばってもメンテめんどくさすぎて妥協。。簡単にできるやつないかな。エクセルで。時間幅細かくするにもデータ増幅複写しないと実現できないからぜんぜんハンディじゃない。手軽さがほしい。まあ、雰囲気わかるからいっか。いったん。

セッションに対してパタンマッチしてみる

URLへのアクセスをシグネチャに置き換えることでパタンマッチしやすくする。マスタデータみたいなものを用意しておく。

コード表示

DROP TABLE url_sig PURGE;
CREATE TABLE url_sig AS
WITH URL AS (
SELECT
	'/' AS req_path
	,'T' AS sig
	,'top page' AS desp
FROM
	dual
UNION ALL
SELECT
	'search' AS req_path
	,'S' AS sig
	,'item search' AS desp
FROM
	dual
UNION ALL
SELECT
	'items' AS req_path
	,'L' AS sig
	,'item list' AS desp
FROM
	dual
UNION ALL
SELECT
	'items/detail' AS req_path
	,'ID' AS sig
	,'item detail' AS desp
FROM
	dual
UNION ALL
SELECT
	'cart' AS req_path
	,'C' AS sig
	,'cart list' AS desp
FROM
	dual
UNION ALL
SELECT
	'cart/add' AS req_path
	,'CA' AS sig
	,'add item into cart' AS desp
FROM
	dual
UNION ALL
SELECT
	'cart/remove' AS req_path
	,'CR' AS sig
	,'remove item from cart' AS desp
FROM
	dual
UNION ALL
SELECT
	'cart/chkout' AS req_path
	,'CO' AS sig
	,'checkout' AS desp
FROM
	dual
)
SELECT
	*
FROM
	URL
;
コード表示

AINE@pdb1> select * from url_sig;

REQ_PATH      SI DESP
------------- -- ---------------------
/             T  top page
/search       S  item search
/items        L  item list
/items/detail ID item detail
/cart         C  cart list
/cart/add     CA add item into cart
/cart/remove  CR remove item from cart
/cart/chkout  CO checkout

8 rows selected.

Elapsed: 00:00:00.02

セッションに対してパタンマッチしてみるためのトランデータつくる

以下のSQLで生成。

コード表示

DROP TABLE test_________ PURGE;

CREATE TABLE test_________ AS 
WITH sub AS (
SELECT
	rownum as rn
	,tp
FROM(
	SELECT
		DISTINCT tp
	FROM
		dual
	MODEL
		DIMENSION BY ( 1 AS rn )
		MEASURES ( sysdate AS tp )
		RULES ITERATE(100)
		( tp[iteration_number] = sysdate - TRUNC(ABS(dbms_random.VALUE(1, 100)), 0)/86400 )
	)tmp
	WHERE
		ROWNUM <= 30
)
SELECT
	rpad('user0',6,decode(MOD(ROWNUM,6),0,6,MOD(ROWNUM,6))) AS user_id
	,s1.tp
	,s2.req_path
FROM
	sub s1
		inner join (SELECT ROWNUM AS rn ,s1.req_path FROM url_sig s1 ORDER BY dbms_random.random()) s2
			on
				decode(mod(s1.rn,8),0,8,mod(s1.rn,8)) = s2.rn
ORDER BY
	1,2
;

SELECT * FROM test_________ ORDER BY user_id,tp;
コード表示

[oracle@f285aba0589a ~]$ sqlplus aine/ORACLE_PWD@pdb1

SQL*Plus: Release 18.0.0.0.0 - Production on Sat Feb 23 20:42:04 2019
Version 18.3.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Sat Feb 23 2019 19:43:30 +09:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.3.0.0.0

AINE@pdb1> ALTER SESSION SET nls_timestamp_tz_format = 'YYYY-MM-DD HH24:MI:SS';

Session altered.

Elapsed: 00:00:00.00
AINE@pdb1> SELECT * FROM test_________ ORDER BY user_id,tp;

USER_ID                  TP                                                                          REQ_PATH
------------------------ --------------------------------------------------------------------------- -------------
user01                   2019-02-23 20:39:50                                                         /
user01                   2019-02-23 20:39:52                                                         /items
user01                   2019-02-23 20:39:58                                                         /cart/remove
user01                   2019-02-23 20:40:22                                                         /cart
user01                   2019-02-23 20:40:27                                                         /cart/add
user02                   2019-02-23 20:40:04                                                         /
user02                   2019-02-23 20:40:07                                                         /items
user02                   2019-02-23 20:40:15                                                         /cart/remove
user02                   2019-02-23 20:40:36                                                         /cart
user02                   2019-02-23 20:40:37                                                         /cart/add
user03                   2019-02-23 20:39:51                                                         /search
user03                   2019-02-23 20:39:54                                                         /items/detail
user03                   2019-02-23 20:40:17                                                         /
user03                   2019-02-23 20:40:20                                                         /items
user03                   2019-02-23 20:40:28                                                         /cart/remove
user04                   2019-02-23 20:40:00                                                         /cart/chkout
user04                   2019-02-23 20:40:06                                                         /search
user04                   2019-02-23 20:40:08                                                         /items/detail
user04                   2019-02-23 20:40:32                                                         /
user04                   2019-02-23 20:40:34                                                         /items
user05                   2019-02-23 20:39:55                                                         /cart
user05                   2019-02-23 20:39:57                                                         /cart/add
user05                   2019-02-23 20:40:16                                                         /cart/chkout
user05                   2019-02-23 20:40:18                                                         /search
user05                   2019-02-23 20:40:21                                                         /items/detail
user06                   2019-02-23 20:40:09                                                         /cart
user06                   2019-02-23 20:40:11                                                         /cart/add
user06                   2019-02-23 20:40:29                                                         /cart/chkout
user06                   2019-02-23 20:40:33                                                         /search
user06                   2019-02-23 20:40:35                                                         /items/detail

30 rows selected.

Elapsed: 00:00:00.00

サマリ行出力してみる

コード表示

alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';
SELECT
	*
FROM
	test_________ MATCH_RECOGNIZE (
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() AS session_id
		, classifier() AS cls
		, COUNT(*) AS evt_cnt
		, COUNT(strt.tp) AS evt__cnt
		, COUNT(dur.tp) AS evt___cnt
		, FIRST(strt.tp) AS start_time
		, LAST(dur.tp) AS end_time
		, round( (LAST(dur.tp) - FIRST(strt.tp) ) * 24 * 60 * 60, 0) AS session_duration
		ONE ROW PER MATCH
	--   ALL ROWS PER MATCH
	PATTERN ( strt dur +) DEFINE
		dur AS ( round( (tp - prev(tp) ) * 24 * 60 * 60, 0) <= 3 )
	);

グラフ化してみる

以下のSQLで生成。

コード表示

alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';
WITH cal AS (
		SELECT
			 A.date_from + (ROWNUM - 1)/86400 AS tp
		FROM
			(
				SELECT
					MIN(tp) AS date_from
					,MAX(tp) AS date_to
					,round((MAX(tp) - MIN(tp))*24*60*60,0) AS diff
				FROM
					test_________
			) A
		CONNECT BY
			LEVEL <= A.diff + 1
),tmp AS (
	SELECT
		s1.*
		,s0.*
	FROM
		cal s1
		LEFT OUTER JOIN 
		test_________ MATCH_RECOGNIZE(
			PARTITION BY user_id
			ORDER BY tp
			MEASURES
				match_number() AS session_id
				,classifier() AS cls
				,COUNT(*) AS evt_cnt
				,COUNT(strt.tp) AS evt__cnt
				,COUNT(dur.tp) AS evt___cnt
				,FIRST(strt.tp) AS start_time
				,LAST(dur.tp) AS end_time
				,round((LAST(dur.tp) - FIRST(strt.tp))*24*60*60,0) AS session_duration
	   ONE ROW PER MATCH
	--   ALL ROWS PER MATCH
	   PATTERN (strt dur+)
	   DEFINE
		   dur AS (round((tp - prev(tp))*24*60*60,0) <= 3)
	)s0 
		ON
			s1.tp BETWEEN s0.start_time AND s0.end_time
)
SELECT
	s1.tp
	, nvl(s1.user01,0) AS user01
	, nvl(s1.user02,0) AS user02
	, nvl(s1.user03,0) AS user03
	, nvl(s1.user04,0) AS user04
	, nvl(s1.user05,0) AS user05
	, nvl(s1.user06,0) AS user06
FROM
	tmp s0
	PIVOT(MIN(s0.evt__cnt) FOR user_id IN ('user01' AS user01,'user02' AS user02,'user03' AS user03,'user04' AS user04,'user05' AS user05,'user06' AS user06)) s1
ORDER BY
	s1.tp
;
コード表示

TP                         USER01        USER02        USER03        USER04        USER05        USER06
------------------- ------------- ------------- ------------- ------------- ------------- -------------
2019/02/23 21:14:58             0             0             1             0             0             0
2019/02/23 21:14:59             0             0             1             0             0             0
2019/02/23 21:15:00             0             0             0             0             0             0
2019/02/23 21:15:01             0             0             0             0             0             0
2019/02/23 21:15:02             0             0             0             0             0             0
2019/02/23 21:15:03             1             0             0             0             0             0
2019/02/23 21:15:04             1             0             0             0             0             0
2019/02/23 21:15:05             1             0             0             0             0             0
2019/02/23 21:15:06             0             0             0             0             0             0
2019/02/23 21:15:07             0             0             0             0             0             0
2019/02/23 21:15:08             0             0             0             0             0             0
2019/02/23 21:15:09             0             0             0             0             0             0
2019/02/23 21:15:10             0             0             0             1             0             0
2019/02/23 21:15:11             0             0             0             1             0             0
2019/02/23 21:15:12             0             0             0             1             0             0
2019/02/23 21:15:13             0             0             0             1             0             0
2019/02/23 21:15:14             0             0             0             0             0             0
2019/02/23 21:15:15             0             0             0             0             0             0
2019/02/23 21:15:16             0             0             0             0             0             0
2019/02/23 21:15:17             0             1             0             0             0             0
2019/02/23 21:15:18             0             1             0             0             0             0
2019/02/23 21:15:19             0             1             0             0             0             0
2019/02/23 21:15:20             0             0             0             0             0             0
2019/02/23 21:15:21             0             0             0             0             0             0
2019/02/23 21:15:22             0             0             0             0             1             0
2019/02/23 21:15:23             0             0             0             0             1             0
2019/02/23 21:15:24             0             0             0             0             0             0
2019/02/23 21:15:25             0             0             0             0             0             0
2019/02/23 21:15:26             0             0             0             0             0             0
2019/02/23 21:15:27             0             0             0             0             0             0
2019/02/23 21:15:28             0             0             1             0             0             0
2019/02/23 21:15:29             0             0             1             0             0             0
2019/02/23 21:15:30             0             0             1             0             0             0
2019/02/23 21:15:31             0             0             1             0             0             0
2019/02/23 21:15:32             0             0             0             0             0             0
2019/02/23 21:15:33             0             0             0             0             0             0
2019/02/23 21:15:34             0             0             0             0             0             1
2019/02/23 21:15:35             0             0             0             0             0             1
2019/02/23 21:15:36             0             0             0             0             0             0
2019/02/23 21:15:37             0             0             0             0             0             0
2019/02/23 21:15:38             0             0             0             0             0             0
2019/02/23 21:15:39             0             0             0             0             0             0
2019/02/23 21:15:40             0             0             0             0             0             0

43 rows selected.

Elapsed: 00:00:00.01

セッションアクセスパタンを検知してみる

コンバージョン率を求めるときに使える。

コンバージョン率(成約率、コンバージョンレート、CVRともいう)とは、Webサイトの目標に達した数を、目標に達する最初の段階に入った数で割った割合のこと。 インターネット広告やECサイトで、効率を計るために用いる。

コード表示

WITH tmp AS (
SELECT
	s1.user_id
	,s1.session_id
	,MIN(s1.tp) AS start_tp
	,MAX(s1.tp) AS end_tp
	,LISTAGG(s0.sig,'') WITHIN GROUP (ORDER BY s1.tp) AS sig
FROM
	test_________ s1
	MATCH_RECOGNIZE (
		PARTITION BY user_id
		ORDER BY tp
		MEASURES
			match_number() AS session_id
		, classifier() AS cls
		, COUNT(*) AS evt_cnt
		, COUNT(strt.tp) AS evt__cnt
		, COUNT(dur.tp) AS evt___cnt
		, FIRST(strt.tp) AS start_time
		, LAST(dur.tp) AS end_time
		, round( (LAST(dur.tp) - FIRST(strt.tp) ) * 24 * 60 * 60, 0) AS session_duration
--		ONE ROW PER MATCH
	   ALL ROWS PER MATCH
	PATTERN ( strt dur +) DEFINE
		dur AS ( round( (tp - prev(tp) ) * 24 * 60 * 60, 0) <= 3 )
	)s1
	INNER JOIN url_sig s0
		ON
			s1.req_path = s0.req_path
GROUP BY
	s1.user_id
	,s1.session_id
)
SELECT
	s0.*
	,SUM(CASE WHEN REGEXP_LIKE(s0.sig,'T+S+(CA)*(CO)*','i') THEN 1 ELSE 0 END) OVER (PARTITION BY s0.user_id,s0.session_id) AS flg
	,SUM(CASE WHEN REGEXP_LIKE(s0.sig,'T+S+(CA)*(CO)*','i') THEN 1 ELSE 0 END) OVER (PARTITION BY s0.user_id,s0.session_id)
	/ COUNT(*) OVER (PARTITION BY s0.user_id)
	AS cvr
FROM
	tmp s0
;

ことおわり

始める前より理解はすこし深まったとおもう。長くなったので、いったんここで区切る。

Leave a Reply

Your email address will not be published. Required fields are marked *