AI Agent開発日記 2025/02/11

2025年2月12日2025年2月13日

やりたいこと

Agent Builder を用いたデータストア参照型対話エージェントの作成において、
データストアのファイル数が増加することで回答率が低下する問題への対処を進める。

データストアファイルがCSVだった場合、ファイル数増加による回答率低下はみられるのか検証する。

試したこと
- 下記に記載している形式のGCSにインポート済みの1ファイルあたり100kbほどのcsvファイルと、データストアにファイルをインポートするコードを使ってファイル複数をアップしてファイル数増加による回答率低下はみられるのか検証する。
結果
- 下記エラーが発生し、ファイルをアップロードできなかった。
- エラーの内容は、「ファイルにID列がなかった」というもので、ID列を追加してもアップロードできず、試しにウェブから手動でGCSにファイルをアップロードしたところ、正常にアップロード出来た。しかし、ウェブからの操作は不便なため、コードでアップロードする方法を調査する。

CSVファイル形式

answer,question
"所在地: 大阪府, 間取り: ワンルーム, 賃料: 30000円, 管理費: 0円, 敷金: 0円, 礼金: 0円",大阪府でワンルームの物件はありますか？
"所在地: 群馬県, 間取り: 1K, 賃料: 30000円, 管理費: 2000円, 敷金: 0円, 礼金: 0円",で1Kの物件はありますか？

データストアへファイルをアップロードするコード

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine, storage
# Variables
def GCS_import_datastore(project_id,location,data_store_id,gcs_files):
 # Set client options
 client_options = (
  ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
  if location != "global"
  else None
 )
 # Create a client
 client = discoveryengine.DocumentServiceClient(client_options=client_options)
 # Set parent path
 parent = client.branch_path(
  project=project_id,
  location=location,
  data_store=data_store_id,
  branch="default_branch",
 )
 # Request to import documents
 request = discoveryengine.ImportDocumentsRequest(
  parent=parent,
  gcs_source=discoveryengine.GcsSource(
   input_uris=gcs_files, # リスト化したファイルを指定
   data_schema="content",
  ),
  reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
 )
 # Make the request
 operation = client.import_documents(request=request)
 print(f"Waiting for operation to complete: {operation.operation.name}")
 response = operation.result(timeout=6800)
 # Handle the response
 print("Response:", response)

データストアへファイルをアップロードするコードで発生したエラー

---------------------------------------------------------------------------
InvalidArgument                           Traceback (most recent call last)
Cell In[19], line 7
      5 for i in range(100):
      6     gcs_files.append(f"gs://{file_path}")
----> 7 GCS_import_datastore(project_id,location,data_store_id,gcs_files)

Cell In[18], line 41, in GCS_import_datastore(project_id, location, data_store_id, gcs_files)
     38 operation = client.import_documents(request=request)
     40 print(f"Waiting for operation to complete: {operation.operation.name}")
---> 41 response = operation.result(timeout=6800)
     43 # Handle the response
     44 print("Response:", response)

File /opt/anaconda3/envs/ab/lib/python3.11/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling)
    256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling)
    258 if self._exception is not None:
    259     # pylint: disable=raising-bad-type
    260     # Pylint doesn't recognize that this is valid in this case.
--> 261     raise self._exception
    263 return self._result

InvalidArgument: 400 Custom Document Id (`_id`) was not found in document. 3: Custom Document Id (`_id`) was not found in document.

日本語を使用していないtxtファイルでもファイル数増加で回答率が低下するか調べる

やりたいこと
- データストアのファイル数増加による回答率の低下はデータの言語が日本語であることが原因かもしれない。そのため、json型txt物件ファイルのデータの日本語をローマ字に置き換えて回答率を調査してみる
試したこと
- pykakasi ライブラリを使用して、既存のjson型txtファイルすべての日本語をローマ字に変換し、それらのファイルをデータストアへアップロード。
- ファイル数を100 400 1000の順番で検証
結果
- ファイル数400までは高確率で回答され、900から回答率が下がった

日本語をローマ字に置き換え前のデータ形式

[
  {
    "id": "物件ID1",
    "seller": null,
    "title": "物件名1",
    "price": {
      "administration": 0,
      "deposit": 0,
      "gratuity": 0,
      "rent": 23000
    },
    "address": "住所1",
    "years": 36,
    "stories": "5階建",
    "floor": "4階",
    "floor_plan": "ワンルーム",
    "occupation_area": "13.96m2",
    "description": null,
    "publish_id": "公開ID1",
    "dt": "日付1",
    "traffic": "交通情報1",
    "labels": "",
    "property_type": {
      "building": "large_apartment",
      "types": "rent"
    },
    "spec": null
  },
  {
    "id": "物件ID2",
    "seller": null,
    "title": "物件名2",
    "price": {
      "administration": 2000,
      "deposit": 0,
      "gratuity": 0,
      "rent": 29000
    },
    "address": "住所2",
    "years": 27,
    "stories": "2階建",
    "floor": "2階",
    "floor_plan": "1K",
    "occupation_area": "21.6m2",
    "description": null,
    "publish_id": "公開ID2",
    "dt": "日付2",
    "traffic": "交通情報2",
    "labels": "",
    "property_type": {
      "building": "apartment",
      "types": "rent"
    },
    "spec": null
  }
]

日本語をローマ字に置き換え後のファイル内容

[
  {
    "id": "物件ID1",
    "seller": null,
    "title": "物件名1",
    "price": {
      "administration": "0",
      "deposit": "0",
      "gratuity": "0",
      "rent": "23000"
    },
    "address": "住所1",
    "years": 36,
    "stories": "階数1",
    "floor": "階1",
    "floor_plan": "間取り1",
    "occupation_area": "13.96m2",
    "description": null,
    "publish_id": "公開ID1",
    "dt": "日付1",
    "traffic": "交通情報1",
    "labels": "",
    "property_type": {
      "building": "large_apartment",
      "types": "rent"
    },
    "spec": null
  },
  {
    "id": "物件ID2",
    "seller": null,
    "title": "物件名2",
    "price": {
      "administration": "2000",
      "deposit": "0",
      "gratuity": "0",
      "rent": "29000"
    },
    "address": "住所2",
    "years": 27,
    "stories": "階数2",
    "floor": "階2",
    "floor_plan": "間取り2",
    "occupation_area": "21.6m2",
    "description": null,
    "publish_id": "公開ID2",
    "dt": "日付2",
    "traffic": "交通情報2",
    "labels": "",
    "property_type": {
      "building": "apartment",
      "types": "rent"
    },
    "spec": null
  }
]