@Alex Shuper

雖然在設計階段應該避免讓Jsonb的資料成為查詢條件,但開發的時候仍然難以避免這樣的需求。Postgres針對Jsonb的查詢提供了不少高效的API,如果想要搭配EFCore使用就會需要預先定義型別或者搭配JsonDocument使用。


準備資料

先在本機以docker準備資料庫

docker run --name my-postgres -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres

接下來我們要先準備幾個類別

// 我們來準備一些動物的資料
public class Animal : IDisposable
{
    public Guid Id { get; set; }
    public JsonDocument JsonbData { get; set; }
 
    public void Dispose() => JsonbData.Dispose();
}
 
// 每一種動物都有不同的資料
public record Dog(string Name, bool IsHealthy);
public record Cat(string Name, Guid CategoryId);
public record Bird(string Name);
 
// 另外貓的種類另外拉出來放到一個資料表中
public class CatCategory
{
    public Guid Id { get; set; }
    public string CategoryName { get; set; }
}

可以用下面的C#程式碼來準備測試用的資料,記得更換自己的連線字串:
https://gist.github.com/marvin-hsu/a7ea9bd6d514ccddb094144b298e39b3

Jsonb欄位作為Join條件

在預先準備的資料中,由於我們把貓的種類紀錄到JsonbData中,假設我想要找到前十隻波斯貓的話,查詢語法就會變成:

(await (from animal in context.Animal
            join category in context.CatCategory
                on animal.JsonbData.RootElement.GetProperty(nameof(Cat.CategoryId)).GetGuid() equals category.Id
            where category.CategoryName == "Persian"
            select animal)
        .Take(10)
        .ToListAsync())
    .Select(x=> x.JsonbData.Deserialize<Cat>().Name)
    .ToList()
    .ForEach(Console.WriteLine);

神奇吧!Jsonb中的欄位竟然可以跟一般的欄位Join,我們來看一下轉譯出來的語法:

SELECT animal.Id AS animal_id, animal.JsonbData AS animal_data
FROM Animal AS animal
INNER JOIN CatCategory AS category ON CAST(animal.JsonbData->>'CategoryId' AS uuid) = category.Id
WHERE category.CategoryName = 'Persian'
LIMIT 10

再來就是執行計畫:

Limit  (cost=23.44..73.87 rows=10 width=55) (actual time=0.039..0.076 rows=10 loops=1)
  ->  Hash Join  (cost=23.44..2378.79 rows=467 width=55) (actual time=0.038..0.075 rows=10 loops=1)
"        Hash Cond: (((animal.""JsonbData"" ->> 'CategoryId'::text))::uuid = category.""Id"")"
"        ->  Seq Scan on ""Animal"" animal  (cost=0.00..2054.00 rows=100000 width=55) (actual time=0.006..0.014 rows=146 loops=1)"
        ->  Hash  (cost=23.38..23.38 rows=5 width=16) (actual time=0.007..0.008 rows=1 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
"              ->  Seq Scan on ""CatCategory"" category  (cost=0.00..23.38 rows=5 width=16) (actual time=0.005..0.006 rows=1 loops=1)"
"                    Filter: (""CategoryName"" = 'Persian'::text)"
                    Rows Removed by Filter: 3
Planning Time: 0.107 ms
Execution Time: 0.093 ms

資料很簡單,但這邊可以看到因為是由貓的類型回來尋找動物的名稱,因此不會走到索引。

使用Where查詢

先來個簡單的查詢,我想要找到第一支健康的狗:

await context.Animal
    .FirstOrDefaultAsync(x => x.JsonbData
        .RootElement
        .GetProperty(nameof(Dog.IsHealthy))
        .GetBoolean());
 
// SELECT a."Id", a."JsonbData"
// FROM "Animal" AS a
// WHERE CAST(a."JsonbData"->>'IsHealthy' AS boolean)
// LIMIT 1

現在來介紹一下npgsql提供的幾個函數:

EF.FunctionsSQL
EF.Functions.JsonContains(customer, @"{""Name"": ""Joe"", ""Age"": 25}")customer @> '{"Name": "Joe", "Age": 25}'
EF.Functions.JsonContained(@"{""Name"": ""Joe"", ""Age"": 25}", customer)'{"Name": "Joe", "Age": 25}' <@ customer
EF.Functions.JsonExists(customer, "Age")customer ? 'Age'
EF.Functions.JsonExistsAny(customer, "Age", "Address")customer ? | ARRAY['Age','Address']
EF.Functions.JsonExistsAll(customer, "Age", "Address")customer ?& ARRAY['Age','Address']
EF.Functions.JsonTypeof(customer.GetProperty("Age")) == "number"jsonb_typeof(customer->'Age') = 'number'

這幾個函數的回傳值都是bool,JsonContainsJsonContained都是以Json的型態去找察相關的物件。Exist系列則是查詢Json結構中的key。最後一個JsonTypeof則是以value的型別為條件。
現在使用EF.Function來查詢前十隻波斯貓,這次直接用CategoryId來查詢

await context.Animal
    .Where(x => EF.Functions.JsonContains(x.JsonbData, """{"CategoryId": "b9daf2a3-d022-4414-b83b-2719afb98a85"}"""))
    .Take(10).ToListAsync();
 
// SELECT a."Id", a."JsonbData"
// FROM "Animal" AS a
// WHERE a."JsonbData" @> '{"CategoryId": "b9daf2a3-d022-4414-b83b-2719afb98a85"}'
// LIMIT @__p_1

這兩個查詢比較一下可以發現直接使用GetProperty轉譯出來的語法會多一次轉型,而使用EF.Function則是直接比對資料的型態。在平常的查詢中,如果遇到需要轉換欄位型別的狀況,往往容易造成索引失效,在Jsonb的查詢中也是盡量避免,通常->>之類的語法應該在已經查詢好資料要整理成所需的格式時使用,使用@>比對結構通常會比較高效,如果是像Join這種必須取出欄位的情況下則是需要確保查詢上都是以jsonb這方為約束另一張表索引的條件。

關於幾個EF.Function可以參考這裡

Jsonb索引

在Postgres中其實可以對Jsonb下索引來增進查詢效能,我們來稍微是一下BtreeGin

  • Btree
    這是針對Jsonb中的特定欄位來下索引,我們來針對Name這個欄位增加索引:
    CREATE INDEX animal_name_btree ON "Animal" USING BTREE (("JsonbData"->>'Name'));
    接下來比較兩種查詢方式:
    explain analyse
    select * from "Animal"
    where "JsonbData" ->> 'Name' = 'de7e7';
     
    -- "Bitmap Heap Scan on ""Animal""  (cost=12.17..886.51 rows=500 width=55) (actual time=0.023..0.024 rows=1 loops=1)"
    -- "  Recheck Cond: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
    --   Heap Blocks: exact=1
    --   ->  Bitmap Index Scan on animal_name_btree  (cost=0.00..12.04 rows=500 width=0) (actual time=0.017..0.018 rows=1 loops=1)
    -- "        Index Cond: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
    -- Planning Time: 0.064 ms
    -- Execution Time: 0.041 ms
     
    explain analyse
    select * from "Animal"
    where "JsonbData" @> '{"Name":"de7e7"}';
     
    -- "Seq Scan on ""Animal""  (cost=0.00..2304.00 rows=10 width=55) (actual time=0.011..20.234 rows=1 loops=1)"
    -- "  Filter: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
    --   Rows Removed by Filter: 99999
    -- Planning Time: 0.053 ms
    -- Execution Time: 20.247 ms
    使用btree可以改善->>的查詢效能,但是有欄位要求彈性並沒有那麼大,從執行計畫中可以看到@>不會走btree index
  • Gin
    Gin index則是對整個欄位做索引,比較適合範圍的查詢
    CREATE INDEX animal_name_gin ON "Animal" USING gin ("JsonbData");
     
    explain analyse
    select * from "Animal"
    where "JsonbData" ->> 'Name' = 'de7e7';
     
    -- "Seq Scan on ""Animal""  (cost=0.00..2554.00 rows=500 width=55) (actual time=0.012..16.196 rows=1 loops=1)"
    -- "  Filter: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
    --   Rows Removed by Filter: 99999
    -- Planning Time: 0.110 ms
    -- Execution Time: 16.211 ms
     
    explain analyse
    select * from "Animal"
    where "JsonbData" @> '{"Name":"de7e7"}';
     
    -- "Bitmap Heap Scan on ""Animal""  (cost=28.08..65.28 rows=10 width=55) (actual time=0.043..0.043 rows=1 loops=1)"
    -- "  Recheck Cond: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
    --   Heap Blocks: exact=1
    --   ->  Bitmap Index Scan on animal_name_btree  (cost=0.00..28.07 rows=10 width=0) (actual time=0.038..0.038 rows=1 loops=1)
    -- "        Index Cond: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
    -- Planning Time: 0.114 ms
    -- Execution Time: 0.057 ms
    這邊可以看到這次就換@>走到索引了,‵Gin‵比較適合範圍的查詢,不需要指定結構

接下來看一下這兩個索引的大小

SELECT
    indexname AS index_name,
    pg_size_pretty(pg_total_relation_size(indexname::regclass)) AS total_size
FROM pg_indexes
WHERE tablename = 'Animal' AND indexname in ('animal_name_btree','animal_name_gin');
 
+-----------------+----------+
|index_name       |total_size|
+-----------------+----------+
|animal_name_btree|2184 kB   |
|animal_name_gin  |5608 kB   |
+-----------------+----------+

理所當然的,Gin的大小比較大

因為這次使用的測試資料比較簡單所以看不太出差異,但在專案中測試的時候,沒有下index的查詢大概需要花上12s,而經過btree或是gin優化後查詢都有顯著的提昇(1~2ms左右),不過Gin的大小真的會大非常多!