PDF plumber strange behaviour. Two PDF that are the same, 1 work and 1 doesn’t work

  Kiến thức lập trình

I have two PDF files which look the same and I want to extract data with this function:

all_data = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            table = page.extract_table()
            if table:
                filtered_table = table[5:]  # Salta le intestazioni
                header = [clean_text(h) for h in filtered_table[0]]  # Pulisci l'intestazione
                data = filtered_table[1:]

                # Rimuovi righe vuote dalla tabella
                data = [row for row in data if any(cell and cell.strip() for cell in row)]

                # Appendi i dati della pagina all'elenco
                all_data.extend(data)
                
    if not all_data:

     
        return

    df = pd.DataFrame(all_data, columns=header)
    df.dropna(how='all', inplace=True)

When I try to extract data this code works for just one PDF. I opened both PDFs in Visual Studio Code and they appear not the same in the first rows.

The one that works :

%PDF-1.7
%����
7 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 242
/Height 43
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Interpolate false
/Filter /FlateDecode
/Length 8514
>>
stream

The one that doesn’t work:

%PDF-1.7
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en) /StructTreeRoot 61 0 R/MarkInfo<</Marked true>>/Metadata 590 0 R/ViewerPreferences 591 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 12/Kids[ 4 0 R 25 0 R 30 0 R 33 0 R 37 0 R 40 0 R 43 0 R 46 0 R 49 0 R 52 0 R 55 0 R 58 0 R] >>
endobj
3 0 obj
<</Title(��Bando 2   CU Allegato B - Piano delle installazioni dettagliato) /Author(GSE) /CreationDate(D:20240904152507+00'00') /ModDate(D:20240904152507+00'00') /Producer() /Creator() >>
endobj
4 0 obj
<</Type/Page/Parent 2 0 R/Resources<</XObject<</Image6 6 0 R/Image12 12 0 R/Image15 15 0 R>>/ExtGState<</GS7 7 0 R/GS10 10 0 R>>/Font<</F1 8 0 R/F2 16 0 R/F3 18 0 R/F4 20 0 R>>/Pattern<</P11 11 0 R/P13 13 0 R/P14 14 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 841.92 595.32] /Contents 5 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 4379>>
stream

I have to write the data to Excel, and when I write data from the second PDF a lot of binary code comes between data in Excel.

I exported both PDF files from Excel, I know it’s strange but I need it for a reason

I need to happen that both PDF are read correctly.

2

Update, the real problem is the mapping function, because i tried to print what pdfplumber read and it’s correct, but when it writes it write a lot of binary code inside the excel

def apply_mapping(text):
for key, value in data_mapping.items():
if key in text:
return value
return text and when i call it

for col in df.columns:
df[col] = df[col].apply(lambda x: apply_mapping(clean_text(str(x))) if x is not None else “”)

that doesn’t work

it returns me (and the script doesn’t work just when this future warning appears

FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
df[col] = df[col].apply(lambda x: apply_mapping(clean_text(str(x))) if x is not None else “”)

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website Kho Theme wordpress Kho Theme WP Theme WP

LEAVE A COMMENT