Langchain language parser does not work with java

  Kiến thức lập trình

I’m trying to read a git repo and parse the files from that repo. For that I’m reading files with the following code

from langchain_community.document_loaders.parsers import LanguageParser
from langchain_community.document_loaders.generic import GenericLoader

def get_git_code_documents(git_url: str, git_name: str):
    if not os.path.exists(git_name):
        repo = Repo.clone_from(git_url, git_name)
        # branch = repo.head.main
    else:
        print("Repo already exists")

    loader = GenericLoader.from_filesystem(
        git_name,
        glob="**/*",
        suffixes=[".py", ".md", ".sh", ".java"],
        parser=LanguageParser(),
    )
    documents = loader.load()

    return documents

But I’m getting the following error

File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 600, in _run_script
    exec(code, module.__dict__)
File "/../LLMs/codebase_openai/app.py", line 56, in <module>
    main()
File "/../LLMs/codebase_openai/app.py", line 29, in main
    git_documents = get_git_code_documents(git_url, git_name)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/llmUtils.py", line 28, in get_git_code_documents
    documents = loader.load()
                ^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_core/document_loaders/base.py", line 29, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/generic.py", line 116, in lazy_load
    yield from self.blob_parser.lazy_parse(blob)
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/language_parser.py", line 214, in lazy_parse
    if not segmenter.is_valid():
           ^^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/tree_sitter_segmenter.py", line 30, in is_valid
    language = self.get_language()
               ^^^^^^^^^^^^^^^^^^^
File "/../LLMs/codebase_openai/codebase/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/language/java.py", line 26, in get_language
    return get_language("java")
           ^^^^^^^^^^^^^^^^^^^^
File "tree_sitter_languages/core.pyx", line 14, in tree_sitter_languages.core.get_language

I installed the tree-sitter and tree-sitter-language but still getting the error.

But the interesting thing is the error seems to be happening only when I’m adding .java to the suffixes list. If I don’t include .java the code runs fine.

Any suggestions?

LEAVE A COMMENT