Relative Content

Tag Archive for text-processing

Finding occurrences of a useful words and phrases in strings

I am building an app that analyzes posts by people by pulling their Tweets and Facebook posts. I need to process all the posts and find useful phrases. What I mean by useful is that, any word or phrase that is a noun/adjective/verb that would represent a discrete object or an idea, or in other words, I am looking for keywords.

Domain-specific language for text search/processing?

I work for an organization that does a lot of work with government data. We have a couple of different projects where we’ve abstracted out common text search/manipulation operations into reusable libraries, for things like standardizing the way politicians’ names are displayed (e.g., transforming “MCDONALD, BOB (R-VA)” into “Bob McDonald (R-VA)”), or finding legal citations in text (e.g., finding a reference to (e.g., finding occurrences of things like “1 U.S.C. 7” in text, determining that it’s a US Code citation, and returning a structure that says it’s referring to section 1 of title 7). These are relatively simple operations, and lots of collaborators in our space would like to use them, but we end up having to pick a language in which to implement each (the former is in Python; the latter, Javascript), and we freeze out potential consumers/contributors who work in different languages and don’t want to resort to hacks like shelling out to a node process to handle their text. This all seems like a shame because what we’re expressing is so simple, and ought, one would think, to be pretty easy to share.

How does Facebook strip html/apostrophes for XSS but also display it?

I’m not quite sure if this is a question for programmers.se rather than stackoverflow, but here goes. So Facebook [or any other large company] when given something like an apostrophe or html, can strip it of its malicious intent, but still display it properly. My current sanitizing function in PHP just strips those characters/makes them harmless via htmlentities() and such. So if I wrote an HTML tag, I would want it to be sanitized but also displayed on the website. How do I do this?

Custom Alphabetic Sorting of Array in Java

I have a requirement to read a text file with lines in tag=value format and then output the file with specific tags listed first and the rest sorted alphabetically. The incoming file is randomly sorted with the exception of the first line. The output needs the first two lines to always be the same. For example, given the following tags:

How advanced are author-recognition methods?

From a written text by an author if a computer program analyses the text, how much can a computer program tell today about the author of some (long enough to be statistically significant) texts?

How advanced are author-recognition methods?

From a written text by an author if a computer program analyses the text, how much can a computer program tell today about the author of some (long enough to be statistically significant) texts?