I’m trying to use the spark nlp to detect English-language texts from a large dataset. The ipynb file is on Google Colab.
Everything works fine, from reading CSV to pipeline fit and result selection. It looks like this:
result.select(F.explode(F.arrays_zip(result.document.result,
result.language.result)).alias("cols"))
.select(F.expr("cols['0']").alias("Document"),
F.expr("cols['1']").alias("Language")).show(truncate=100)
+----------------------------------------------------------------------------------------------------+--------+
| Document|Language|
+----------------------------------------------------------------------------------------------------+--------+
|#Elecciones2020 | En #Florida: #JoeBiden dice que #DonaldTrump solo se preocupa por él mismo. El ...| es|
| | null|
| | null|
| | null|
| 0.0| Unknown|
| | null|
| | null|
| 1860.0| Unknown|
|#Trump: As a student I used to hear for years, for ten years, I heard China! In 2019! And we have...| en|
| 1.0| Unknown|
| 2 hours since last tweet from #Trump! Maybe he is VERY busy. Tremendously busy.| en|
|@CLady62 Her 15 minutes were over long time ago. Omarosa never represented the black community! #...| en|
| 0.0| Unknown|
| @richardmarx Glad u got out of the house! DICK!!#trump 2020????????????????????????| en|
|@DeeviousDenise @realDonaldTrump @nypost There won’t be many of them. Unless you all have been v...| en|
|One of the single most effective remedies to eradicate another round of #Trump Plague in our #Whi...| en|
| #Election2020 #Trump | en|
| 0.0| Unknown|
+----------------------------------------------------------------------------------------------------+--------+
but then I need to filter En texts. I tried select(F.expr("cols['1']").distinct().count()
, filter(F.expr("cols['1']") == 'en')
and even list comprehension after collect()
, but none returned anything. After waiting an hour, I interrupted the process, and the spark seemed to stop responding, e.g., not even giving results to read.csv()
.
Why is this happening and how to solve this problem?