Question

I’m trying to use the spark nlp to detect English-language texts from a large dataset. The ipynb file is on Google Colab.

Everything works fine, from reading CSV to pipeline fit and result selection. It looks like this:

result.select(F.explode(F.arrays_zip(result.document.result,
                                     result.language.result)).alias("cols")) 
      .select(F.expr("cols['0']").alias("Document"),
              F.expr("cols['1']").alias("Language")).show(truncate=100)



+----------------------------------------------------------------------------------------------------+--------+
|                                                                                            Document|Language|
+----------------------------------------------------------------------------------------------------+--------+
|#Elecciones2020 | En #Florida: #JoeBiden dice que #DonaldTrump solo se preocupa por él mismo. El ...|      es|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                                 0.0| Unknown|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                              1860.0| Unknown|
|#Trump: As a student I used to hear for years, for ten years, I heard China! In 2019! And we have...|      en|
|                                                                                                 1.0| Unknown|
|                     2 hours since last tweet from #Trump! Maybe he is VERY busy. Tremendously busy.|      en|
|@CLady62 Her 15 minutes were over long time ago. Omarosa never represented the black community! #...|      en|
|                                                                                                 0.0| Unknown|
|                             @richardmarx Glad u got out of the house! DICK!!#trump 2020????????????????????????|      en|
|@DeeviousDenise @realDonaldTrump @nypost There won’t be many of them.  Unless you all have been v...|      en|
|One of the single most effective remedies to eradicate another round of #Trump Plague in our #Whi...|      en|
|                                                                               #Election2020 #Trump |      en|
|                                                                                                 0.0| Unknown|
+----------------------------------------------------------------------------------------------------+--------+

but then I need to filter En texts. I tried select(F.expr("cols['1']").distinct().count(), filter(F.expr("cols['1']") == 'en') and even list comprehension after collect(), but none returned anything. After waiting an hour, I interrupted the process, and the spark seemed to stop responding, e.g., not even giving results to read.csv().

Why is this happening and how to solve this problem?

Why did the spark stop responding after calling pipeline fit and distinct count?

LEAVE A COMMENT Hủy